lecture 4: basic nonparametric estimationdoubleh/eco273/newslides4.pdfexample: use any kernel...
TRANSCRIPT
Lecture 4: Basic Nonparametric Estimation
Instructor: Han Hong
Department of EconomicsStanford University
2011
Han Hong Basic Nonparametric Estimation
Basic View
• There can be many meanings to “nonparametrics”.
• One meaning is optimization over a set of function.
• For example, given the sample of observations x1, . . . , xn, finda distribution function under which the joint probability ofx1, . . . , xn is maximized.
• This is also called “nonparametric maximum likelihood”.
• The meaning of “nonparametric” for now is density estimateand estimation of conditional expectations.
Han Hong Basic Nonparametric Estimation
Density Estimate: Motivation
• One motivation is to first use the histogram to estimate thedensity:
1
2h
# of xi in (x − h, x + h)
n=
1
2h
1
n
n∑t=1
1 (x − h ≤ xi ≤ x + h)
=1
nh
n∑i=1
1
21
(|x − xi |
h≤ 1
)• 1
21 (|x | ≤ 1) is the uniform density over (−1, 1), called theuniform kernel.
• Generally, use other density function K (·) to get
f̂ (x) =1
nh
n∑t=1
K
(x − xi
h
).
Han Hong Basic Nonparametric Estimation
• Another motivation is to estimate the distribution functionF (x) by
F̂ (x) =1
n
n∑t=1
1 (xi ≤ x) ,
but you can’t differentiate it to get the density.
• Replace 1 (xi ≤ x) by G(xi−xh
)where G (·) is any smooth
distribution function (G (∞) = 1,G (−∞) = 0), and h→ 0.
• In practice, take h as some small but fixed number, like 0.1.
• So let K = G ′ (·), differentiate F̂ (x) to get
f̂ (x) =1
nh
n∑t=1
K
(xi − x
h
)or
1
nhd
n∑t=1
K
(xi − x
h
)if x ∈ Rd .
Han Hong Basic Nonparametric Estimation
Conditional Expectation: Motivation
• Estimate E (y |x) or more generally E (g (y) |x) for somefunction g (·), or things like conditional quantiles.
• Local weighting: use observations xi close to x .
• Take a neighborhood N around x and the size of N shouldshrink to 0 but not too fast.
• Average over those yi for which xi ∈ N .
• More generally give more weights to those yi if xi is close to x ,and less weights to those yi if xi is far away from x .
• For weights Wn (x , xi ) such that
(1)∑n
i=1Wn (x , xi ) = 1, (2) Wn (x , xi )→ 0 if xi 6= x ,
(3) max1≤i≤n |Wn (x , xi ) | → 0 as n→∞,
estimate E (y |x) by∑n
i=1Wn (x , xi )Yi .
Han Hong Basic Nonparametric Estimation
Classification
• Anything you do parametrically, if you do that only for xiclose to x , then you become “nonparametric”.
• Local nonparametric estimates:
• kernel smoothing
• k-nearest neighborhood (k-NN)
• local polynomials
• Global nonparametric estimates:
• series (sieve)
• splines
• The focus today is kernel.
Han Hong Basic Nonparametric Estimation
Kernel Smoothing
• Use density weighting for the weights Wn (x , xi ), then get thekernel estimator of E (y |x).
• If xi is one-dimensional, let
Wn (x , xi ) =1nhK
(x−xih
)1nh
∑ni=1 K
(x−xih
) , satisfyingn∑
i=1
Wn (x , xi ) = 1.
• The kernel estimator of E (y |x) is
n∑i=1
Wn (x , xi )Yi =n∑
i=1
1nhK
(x−xih
)1nh
∑ni=1 K
(x−xih
)Yi =1nh
∑ni=1 K
(x−xih
)Yi
1nh
∑ni=1 K
(x−xih
) .
• If xi ∈ Rd , use the multidimension density function andreplace h with hd .
Han Hong Basic Nonparametric Estimation
Another View of Kernel Estimator
• Estimate γ (x) and f (x) separately for
E (y |x) =E (y |x) f (x)
f (x)=
∫yf (y , x) dy
f (x)=γ (x)
f (x)
• f̂ (x) = 1nhd
∑ni=1 K
(x−xih
).
• For γ̂ (x), plug
f̂ (x , y) =1
nhd+1
n∑i=1
K
(x − xih
)K̄
(yi − y
h
)into
∫yf (y , x) dy , and let u = (yi − y) /h:∫
y f̂ (y , x) dy =1
nhd
n∑i=1
K
(x − xih
)∫y
1
hK̄
(yi − y
h
)dy
=1
nhd
n∑i=1
K
(x − xih
)∫(yi + uh) K̄ (u) du =
1
nhd
n∑i=1
K
(x − xih
)yi .
Han Hong Basic Nonparametric Estimation
• Another view for γ̂ (x): think of∫y f̂ (y , x) dy as
∫ydP,
where P is the measure over y defined by
P (yi ≤ y , xi = x) =d
dxP (yi ≤ y , xi ≤ x)
estimate=
d
dx
1
n
n∑i=1
1 (yi ≤ y)G
(xi − x
h
)=
1
nhd
n∑i=1
1 (yi ≤ y)K
(xi − x
h
)• Plug in this estimate of P into
∫ydP:∫
ydP̂ =
∫yd
1
nhd
n∑i=1
1 (yi ≤ y)K
(xi − x
h
)
=1
nhd
n∑i=1
K
(xi − x
h
)∫yd1 (yi ≤ y) =
1
nhd
n∑i=1
K
(xi − x
h
)yi
Han Hong Basic Nonparametric Estimation
Note
• Only need to care γ̂ (x) since f̂ (x) is just a special case ofγ̂ (x) if yi ≡ 1 identically.
• Convenient forms of kernel (density) function:
• Uniform kernel 121 (|u| ≤ 1);
• Triangular kernel: (1− |u|) 1 (|u| ≤ 1);
• Quartic, epanechniknov, gaussian, etc.
• Estimating derivatives: as long as kernel is smoothdifferentiable, just simply differentiate γ̂ (x):
γ̂(k) (x) =1
nhk+d
n∑i=1
K (k)
(xi − x
h
)yi
Han Hong Basic Nonparametric Estimation
k-NN and Local Polynomials
• Other two major weighting schemes for Wni (x).
• k-nearest neighborhood (k-NN)
• Use k closest neighbors of point x instead of fixed one.
• Weight these k neighbors equally or according to distances.
• Example: use any kernel density weight K (·).
• Local polynomial
• Run a kth polynomial regression using observations over|xi − x | ≤ h.
• The degree k corresponds to the order of the kernel.
Han Hong Basic Nonparametric Estimation
Series and Splines
• Series (Sieve)
• The only difference between series and local polynomials isthat you run the polynomials using all observations, instead ofonly a shrinking neighborhood (x − h, x + h).
• Instead of fixing k , let k →∞.
• Instead of using polynomials, use family of orthogonal series offunctions, like trigonometric function, etc.
• Splines
• Find a twice differentiable function g (x) that minimizes∑ni=1 (yi − g (xi ))2 + λ
∫g ′′ (x)2 dx , for some λ > 0.
• λ∫g ′′ (x)2 dx is to penalize the roughness of the estimate g .
• This will give a cubic polynomial with continuous secondderivatives.
Han Hong Basic Nonparametric Estimation
Optimal Rate of Convergence for Nonparametric Estimates
• Curse of dimensionality:For a given bandwidth (window size), the higher dimension x ,the less data in a neighborhood with bandwidth h.
• If both h→ 0 and nhd →∞, then the estimate is consistent.
• How about the speed at which estimator converges?
• Conclusion:Suppose the true function γ (x) is pth degree differentiable,all pth derivative bounded uniformly over x . Then the optimal
bandwidth hopt is n−1
2p+d , and the best rate at which γ̂ (x)
can approach γ (x) is Op
(n−
p2p+d
).
Han Hong Basic Nonparametric Estimation
• The problem here is the bias and variance trade-off.• The smaller the h, the smaller the bias, but the less
observations you have, thus the large the variance.
• Criterion: total error = bias + estimation error , or MSE .
• The bias is Op(hp).
• Use p bounded derivatives condition and taylor expansion.
• The variation is Op
(1√nhd
).
• Think of x̄ − µ = Op
(1√n
), by analogy with nhd .
• Total error is Op
(hp + 1√
nhd
).
Han Hong Basic Nonparametric Estimation
• Find a h to minimize total error,
hopt = O(n−
12p+d
).
• Then the (pointwise) optimal rate of convergence is
O(hpopt
)= O
(1√nhd
)= O
(n−
p2p+d
).
• It is not possible to have√n convergence for nonparametric
estimates since p2p+d <
12 .
• Sometimes n1/4 rate of convergence is needed for getting ridof the second order terms for semiparametric estimators,which means p > d/2.
Han Hong Basic Nonparametric Estimation
Optimal Rate for Derivative Estimates
• The optimal bandwidth of γ(k) (x) is of the same order asthat of estimating γ (x) itself.
• The bias is Op(hp−k), and the variation is Op
(1
hk√nhd
).
• The total error is Op
(hp−k + 1
hk√nhd
).
• Find a h to minimize this again,
hopt = n−1
2p+d .
• Then the best convergence rate is
Op
(np−k
)= Op
(1
hk√nhd
)= Op
(n−
p−k2p+d
).
Han Hong Basic Nonparametric Estimation
Higher Order Kernels
• A kernel of order r is defined as those K (·) for which:∫K (u) du = 1,
∫K (u) uqdu = 0,∀q = 1, . . . , r − 1,∫
|urK (u) |du <∞.
• Bias of kernel estimates = E γ̂ (x)− γ (x)
E γ̂ (x) =E1
nhd
n∑i=1
K
(x − xih
)Yi =
∫1
hdK
(x − xih
)E (yi |xi ) f (xi ) dxi
=
∫1
hdK
(x − xih
)γ (xi ) dxi =
∫K (u) γ (x + uh) du
=γ (x) +r−1∑j=1
hjγ(j)
j!
∫ujk (u) du + hr
1
r !
∫γ(r) (x∗) urK (u) du
• If γ (x) has pth bounded derivatives and the kernel is of orderr , then the bias = hmin(p,r).
Han Hong Basic Nonparametric Estimation
• Variance of kernel estimates:
Var (γ̂(x)) =1
n2h2d
n∑i=1
Var
(K
(x − xih
)Yi
)
=1
nh2dE
(K 2
(x − xih
)Y 2i
)− 1
nh2d
(EK
(x − xih
)Yi
)2
=1
nhd
∫1
hdK 2
(x − xih
)E(y2i |xi)f (xi ) dxi −
1
n
(E
1
hdK
(x − xih
)Yi
)2
=1
nhd
∫1
hdK 2
(x − xih
)g (xi ) dxi + O
(1
n
)=
1
nhd
∫K 2 (u) g (x + uh) du + O
(1
n
)=
1
nhd
∫K 2 (u) g (x) du +
1
nhdh
∫K 2 (u) g ′ (x∗) udu + O
(1
n
)=
1
nhd
∫K 2 (u) g (x) du + O
(1
nhdh
)+ O
(1
n
)= O
(1
nhd
)
Han Hong Basic Nonparametric Estimation
Asymptotic Distribution, Confidence Band
• If use h ∼ hopt , the asymptotic distribution will depend onboth the bias and the variance.
• If use h << hopt , i.e., hhopt→ 0, the asymptotic distribution
has no bias in but the convergence rate is not the fastest.
• Example: consider d = 1, r = 2, then hopt = n−1
2p+d = n−15 .
• Find the asymptotic distribution of√nhopt (m̂ (x)−m (x)) = h−2opt (m̂ (x)−m (x)) ,
for m̂ (x) = γ̂(x)
f̂ (x).
Han Hong Basic Nonparametric Estimation
Bias
• Linearization
m̂ (x)−m (x) ≈ 1
f (x)(γ̂ (x)− γ (x))− γ (x)
f (x)2
(f̂ (x)− f (x)
)• As seen above, E γ̂ (x)− γ (x) = 1
2h2γ′′ (x)
∫u2K (u) du.
• E f̂ (x)− f (x) = 12h
2f ′′ (x)∫u2K (u) du,
since γ (x) = m (x) f (x) and m (x) ≡ 1.
• Therefore,
Eh−2opt (m̂ (x)−m (x)) =1
2
(γ′′
f− m
ff ′′)∫
u2K (u) du
=1
2
(m′′f + 2m′f ′ + mf ′′
f− m
ff ′′)∫
u2K (u) du
=1
2
2m′ (x) f ′ (x) + m′′ (x) f (x)
f (x)
∫u2K (u) du.
Han Hong Basic Nonparametric Estimation
Variance
• As seen above, for g (x) = E(y2|x
)f (x),
Var(√
nh (γ̂ (x)− γ (x)))→ g (x)
∫K 2 (u) du.
• Var(√
nh(f̂ (x)− f (x)))→ f (x)
∫K 2 (u) du since for density
estimate where y ≡ 1, g (x) = f (x).
• The covariance between γ̂ (x) and f̂ (x):
Cov(√
nh (γ̂ (x)− γ (x)) ,√nh(f̂ (x)− f (x))
)→ γ (x)
∫K 2 (u) du.
• Therefore, use the delta method
Var(√
nh (m̂ (x)−m (x)))
= Var
(√nh
(1
fγ̂ − m
ff̂
))=
(1
f 2E(y2|x
)f − 2
f 2mγ +
m2
f 2f
)∫K 2 (u) du
=1
f (x)
(E(y2|x
)−m (x)2
)∫K 2 (u) du =
1
f (x)σ2 (x)
∫K 2 (u) du
Han Hong Basic Nonparametric Estimation
• To summarize:√nh (m̂ (x)−m (x))
d−→
N
(m′′ (x) f (x) + 2m′ (x) f ′ (x)
2f (x)
∫u2K (u) du,
1
f (x)σ2 (x)
∫K 2 (u) du
)• If use a undersmooth bandwidth h << n−1/5, say h = n−1/4,
√nh (m̂ (x)−m (x))
d−→ N
(0,
1
f (x)σ2 (x)
∫K 2 (u) du
)• If use hopt to draw the confidence interval around m̂ (x),
consistent bias term is needed.
• However, γ′′ (x) can NOT be estimated consistently usinghopt . Instead, use a oversmoothed bandwidth, say g = n−1/6.
Han Hong Basic Nonparametric Estimation
Automatic Bandwidth Selection
• Good fit of estimate:
• Minimize∑n
i=1 (m̂ (xi )−m (xi ))2.
• If replace m (xi ) with yi , we will get perfect fit 0 since ash→ 0, m̂ (xi ) = yi .
• Another way to think about this,
n∑i=1
(m̂ (xi )− yi )2 =
n∑i=1
(m̂ (xi )−m (xi )− εi )2
=n∑
i=1
(m̂ (xi )−m (xi ))2︸ ︷︷ ︸what we want
+n∑
i=1
ε2i︸ ︷︷ ︸unrelated
− 2n∑
i=1
(m̂ (xi )−m (xi )) εi︸ ︷︷ ︸the trouble
.
Han Hong Basic Nonparametric Estimation
• Expectation of trouble term:
En∑
i=1
1
nh
n∑j=1
K
(xi − xj
h
)εjεi =
1
nh
n∑i=1
K (0)σ2 =1
hσ2K (0)
• Cross validation
• Leave-one-out estimate m̂−i (xi ) = 1(n−1)h
∑nj 6=i K
(xj−xih
)yi
• Minimize cross-validation function
CV (h) =n∑
i=1
(m−i (xi )− yi )2
• Penalizing function
• Consistent trouble term estimate K (0) 1n
∑ni=1 (yi − m̂ (xi ))2
• Minimize penalizing function
G (h) =n∑
i=1
(m̂ (xi )− yi )2 + 2K (0)
1
n
n∑i=1
(yi − m̂ (xi ))2
Han Hong Basic Nonparametric Estimation
Bias reduction by Jacknifing
• It is essentially equivalent to high order kernel.
• It doesn’t make any difference if you are just running a simplekernel regression.
• If the objective function is only convex with positive K (·),say, run a nonparametric quantile regression, thenoperationally the Jacknife method is very useful in preservingthe convexity of the objective function.
Han Hong Basic Nonparametric Estimation
Uniform rate of convergence
• It is useful to obtain optimal bandwidth and optimal uniformconvergence rate, i.e., for supx∈X |γ̂ (x)− γ (x) |.
• Again, consider the bias-variance tradeoff.
• The bias supx∈X |γ̂ (x)− γ (x) | for rth order kernel is OP(hp).
• The error supx∈X |γ̂ (x)− E γ̂ (x) | is Op
((nhd
log n
)−1/2).
• Use Berstein inequality in the proof.
• Minimize total error OP(hp) + Op
((nhd
log n
)−1/2).
Han Hong Basic Nonparametric Estimation