lecture 4: basic nonparametric estimationdoubleh/eco273/newslides4.pdfexample: use any kernel...

Lecture 4: Basic Nonparametric Estimation

Instructor: Han Hong

Department of EconomicsStanford University

2011

Han Hong Basic Nonparametric Estimation

Basic View

• There can be many meanings to “nonparametrics”.

• One meaning is optimization over a set of function.

• For example, given the sample of observations x1, . . . , xn, finda distribution function under which the joint probability ofx1, . . . , xn is maximized.

• This is also called “nonparametric maximum likelihood”.

• The meaning of “nonparametric” for now is density estimateand estimation of conditional expectations.


Density Estimate: Motivation

• One motivation is to first use the histogram to estimate thedensity:

1

2h

# of xi in (x − h, x + h)

n=

1

2h

1

n

n∑t=1

1 (x − h ≤ xi ≤ x + h)

=1

nh

n∑i=1

1

21

(|x − xi |

h≤ 1

)• 1

21 (|x | ≤ 1) is the uniform density over (−1, 1), called theuniform kernel.

• Generally, use other density function K (·) to get

f̂ (x) =1

nh

n∑t=1

K

(x − xi

h

).


• Another motivation is to estimate the distribution functionF (x) by

F̂ (x) =1

n

n∑t=1

1 (xi ≤ x) ,

but you can’t differentiate it to get the density.

• Replace 1 (xi ≤ x) by G(xi−xh

)where G (·) is any smooth

distribution function (G (∞) = 1,G (−∞) = 0), and h→ 0.

• In practice, take h as some small but fixed number, like 0.1.

• So let K = G ′ (·), differentiate F̂ (x) to get

f̂ (x) =1

nh

n∑t=1

K

(xi − x

h

)or

1

nhd

n∑t=1

K

(xi − x

h

)if x ∈ Rd .


Conditional Expectation: Motivation

• Estimate E (y |x) or more generally E (g (y) |x) for somefunction g (·), or things like conditional quantiles.

• Local weighting: use observations xi close to x .

• Take a neighborhood N around x and the size of N shouldshrink to 0 but not too fast.

• Average over those yi for which xi ∈ N .

• More generally give more weights to those yi if xi is close to x ,and less weights to those yi if xi is far away from x .

• For weights Wn (x , xi ) such that

(1)∑n

i=1Wn (x , xi ) = 1, (2) Wn (x , xi )→ 0 if xi 6= x ,

(3) max1≤i≤n |Wn (x , xi ) | → 0 as n→∞,

estimate E (y |x) by∑n

i=1Wn (x , xi )Yi .


Classification

• Anything you do parametrically, if you do that only for xiclose to x , then you become “nonparametric”.

• Local nonparametric estimates:

• kernel smoothing

• k-nearest neighborhood (k-NN)

• local polynomials

• Global nonparametric estimates:

• series (sieve)

• splines

• The focus today is kernel.


Kernel Smoothing

• Use density weighting for the weights Wn (x , xi ), then get thekernel estimator of E (y |x).

• If xi is one-dimensional, let

Wn (x , xi ) =1nhK

(x−xih

)1nh

∑ni=1 K

(x−xih

) , satisfyingn∑

i=1

Wn (x , xi ) = 1.

• The kernel estimator of E (y |x) is

n∑i=1

Wn (x , xi )Yi =n∑

i=1

1nhK

(x−xih

)1nh

∑ni=1 K

(x−xih

)Yi =1nh

∑ni=1 K

(x−xih

)Yi

1nh

∑ni=1 K

(x−xih

) .

• If xi ∈ Rd , use the multidimension density function andreplace h with hd .


Another View of Kernel Estimator

• Estimate γ (x) and f (x) separately for

E (y |x) =E (y |x) f (x)

f (x)=

∫yf (y , x) dy

f (x)=γ (x)

f (x)

• f̂ (x) = 1nhd

∑ni=1 K

(x−xih

).

• For γ̂ (x), plug

f̂ (x , y) =1

nhd+1

n∑i=1

K

(x − xih

)K̄

(yi − y

h

)into

∫yf (y , x) dy , and let u = (yi − y) /h:∫

y f̂ (y , x) dy =1

nhd

n∑i=1

K

(x − xih

)∫y

1

hK̄

(yi − y

h

)dy

=1

nhd

n∑i=1

K

(x − xih

)∫(yi + uh) K̄ (u) du =

1

nhd

n∑i=1

K

(x − xih

)yi .


• Another view for γ̂ (x): think of∫y f̂ (y , x) dy as

∫ydP,

where P is the measure over y defined by

P (yi ≤ y , xi = x) =d

dxP (yi ≤ y , xi ≤ x)

estimate=

d

dx

1

n

n∑i=1

1 (yi ≤ y)G

(xi − x

h

)=

1

nhd

n∑i=1

1 (yi ≤ y)K

(xi − x

h

)• Plug in this estimate of P into

∫ydP:∫

ydP̂ =

∫yd

1

nhd

n∑i=1

1 (yi ≤ y)K

(xi − x

h

)

=1

nhd

n∑i=1

K

(xi − x

h

)∫yd1 (yi ≤ y) =

1

nhd

n∑i=1

K

(xi − x

h

)yi


Note

• Only need to care γ̂ (x) since f̂ (x) is just a special case ofγ̂ (x) if yi ≡ 1 identically.

• Convenient forms of kernel (density) function:

• Uniform kernel 121 (|u| ≤ 1);

• Triangular kernel: (1− |u|) 1 (|u| ≤ 1);

• Quartic, epanechniknov, gaussian, etc.

• Estimating derivatives: as long as kernel is smoothdifferentiable, just simply differentiate γ̂ (x):

γ̂(k) (x) =1

nhk+d

n∑i=1

K (k)

(xi − x

h

)yi


k-NN and Local Polynomials

• Other two major weighting schemes for Wni (x).

• k-nearest neighborhood (k-NN)

• Use k closest neighbors of point x instead of fixed one.

• Weight these k neighbors equally or according to distances.

• Example: use any kernel density weight K (·).

• Local polynomial

• Run a kth polynomial regression using observations over|xi − x | ≤ h.

• The degree k corresponds to the order of the kernel.


Series and Splines

• Series (Sieve)

• The only difference between series and local polynomials isthat you run the polynomials using all observations, instead ofonly a shrinking neighborhood (x − h, x + h).

• Instead of fixing k , let k →∞.

• Instead of using polynomials, use family of orthogonal series offunctions, like trigonometric function, etc.

• Splines

• Find a twice differentiable function g (x) that minimizes∑ni=1 (yi − g (xi ))2 + λ

∫g ′′ (x)2 dx , for some λ > 0.

• λ∫g ′′ (x)2 dx is to penalize the roughness of the estimate g .

• This will give a cubic polynomial with continuous secondderivatives.


Optimal Rate of Convergence for Nonparametric Estimates

• Curse of dimensionality:For a given bandwidth (window size), the higher dimension x ,the less data in a neighborhood with bandwidth h.

• If both h→ 0 and nhd →∞, then the estimate is consistent.

• How about the speed at which estimator converges?

• Conclusion:Suppose the true function γ (x) is pth degree differentiable,all pth derivative bounded uniformly over x . Then the optimal

bandwidth hopt is n−1

2p+d , and the best rate at which γ̂ (x)

can approach γ (x) is Op

(n−

p2p+d

).


• The problem here is the bias and variance trade-off.• The smaller the h, the smaller the bias, but the less

observations you have, thus the large the variance.

• Criterion: total error = bias + estimation error , or MSE .

• The bias is Op(hp).

• Use p bounded derivatives condition and taylor expansion.

• The variation is Op

(1√nhd

).

• Think of x̄ − µ = Op

(1√n

), by analogy with nhd .

• Total error is Op

(hp + 1√

nhd

).


• Find a h to minimize total error,

hopt = O(n−

12p+d

).

• Then the (pointwise) optimal rate of convergence is

O(hpopt

)= O

(1√nhd

)= O

(n−

p2p+d

).

• It is not possible to have√n convergence for nonparametric

estimates since p2p+d <

12 .

• Sometimes n1/4 rate of convergence is needed for getting ridof the second order terms for semiparametric estimators,which means p > d/2.


Optimal Rate for Derivative Estimates

• The optimal bandwidth of γ(k) (x) is of the same order asthat of estimating γ (x) itself.

• The bias is Op(hp−k), and the variation is Op

(1

hk√nhd

).

• The total error is Op

(hp−k + 1

hk√nhd

).

• Find a h to minimize this again,

hopt = n−1

2p+d .

• Then the best convergence rate is

Op

(np−k

)= Op

(1

hk√nhd

)= Op

(n−

p−k2p+d

).


Higher Order Kernels

• A kernel of order r is defined as those K (·) for which:∫K (u) du = 1,

∫K (u) uqdu = 0,∀q = 1, . . . , r − 1,∫

|urK (u) |du <∞.

• Bias of kernel estimates = E γ̂ (x)− γ (x)

E γ̂ (x) =E1

nhd

n∑i=1

K

(x − xih

)Yi =

∫1

hdK

(x − xih

)E (yi |xi ) f (xi ) dxi

=

∫1

hdK

(x − xih

)γ (xi ) dxi =

∫K (u) γ (x + uh) du

=γ (x) +r−1∑j=1

hjγ(j)

j!

∫ujk (u) du + hr

1

r !

∫γ(r) (x∗) urK (u) du

• If γ (x) has pth bounded derivatives and the kernel is of orderr , then the bias = hmin(p,r).


• Variance of kernel estimates:

Var (γ̂(x)) =1

n2h2d

n∑i=1

Var

(K

(x − xih

)Yi

)

=1

nh2dE

(K 2

(x − xih

)Y 2i

)− 1

nh2d

(EK

(x − xih

)Yi

)2

=1

nhd

∫1

hdK 2

(x − xih

)E(y2i |xi)f (xi ) dxi −

1

n

(E

1

hdK

(x − xih

)Yi

)2

=1

nhd

∫1

hdK 2

(x − xih

)g (xi ) dxi + O

(1

n

)=

1

nhd

∫K 2 (u) g (x + uh) du + O

(1

n

)=

1

nhd

∫K 2 (u) g (x) du +

1

nhdh

∫K 2 (u) g ′ (x∗) udu + O

(1

n

)=

1

nhd

∫K 2 (u) g (x) du + O

(1

nhdh

)+ O

(1

n

)= O

(1

nhd

)


Asymptotic Distribution, Confidence Band

• If use h ∼ hopt , the asymptotic distribution will depend onboth the bias and the variance.

• If use h << hopt , i.e., hhopt→ 0, the asymptotic distribution

has no bias in but the convergence rate is not the fastest.

• Example: consider d = 1, r = 2, then hopt = n−1

2p+d = n−15 .

• Find the asymptotic distribution of√nhopt (m̂ (x)−m (x)) = h−2opt (m̂ (x)−m (x)) ,

for m̂ (x) = γ̂(x)

f̂ (x).


Bias

• Linearization

m̂ (x)−m (x) ≈ 1

f (x)(γ̂ (x)− γ (x))− γ (x)

f (x)2

(f̂ (x)− f (x)

)• As seen above, E γ̂ (x)− γ (x) = 1

2h2γ′′ (x)

∫u2K (u) du.

• E f̂ (x)− f (x) = 12h

2f ′′ (x)∫u2K (u) du,

since γ (x) = m (x) f (x) and m (x) ≡ 1.

• Therefore,

Eh−2opt (m̂ (x)−m (x)) =1

2

(γ′′

f− m

ff ′′)∫

u2K (u) du

=1

2

(m′′f + 2m′f ′ + mf ′′

f− m

ff ′′)∫

u2K (u) du

=1

2

2m′ (x) f ′ (x) + m′′ (x) f (x)

f (x)

∫u2K (u) du.


Variance

• As seen above, for g (x) = E(y2|x

)f (x),

Var(√

nh (γ̂ (x)− γ (x)))→ g (x)

∫K 2 (u) du.

• Var(√

nh(f̂ (x)− f (x)))→ f (x)

∫K 2 (u) du since for density

estimate where y ≡ 1, g (x) = f (x).

• The covariance between γ̂ (x) and f̂ (x):

Cov(√

nh (γ̂ (x)− γ (x)) ,√nh(f̂ (x)− f (x))

)→ γ (x)

∫K 2 (u) du.

• Therefore, use the delta method

Var(√

nh (m̂ (x)−m (x)))

= Var

(√nh

(1

fγ̂ − m

ff̂

))=

(1

f 2E(y2|x

)f − 2

f 2mγ +

m2

f 2f

)∫K 2 (u) du

=1

f (x)

(E(y2|x

)−m (x)2

)∫K 2 (u) du =

1

f (x)σ2 (x)

∫K 2 (u) du


• To summarize:√nh (m̂ (x)−m (x))

d−→

N

(m′′ (x) f (x) + 2m′ (x) f ′ (x)

2f (x)

∫u2K (u) du,

1

f (x)σ2 (x)

∫K 2 (u) du

)• If use a undersmooth bandwidth h << n−1/5, say h = n−1/4,

√nh (m̂ (x)−m (x))

d−→ N

(0,

1

f (x)σ2 (x)

∫K 2 (u) du

)• If use hopt to draw the confidence interval around m̂ (x),

consistent bias term is needed.

• However, γ′′ (x) can NOT be estimated consistently usinghopt . Instead, use a oversmoothed bandwidth, say g = n−1/6.


Automatic Bandwidth Selection

• Good fit of estimate:

• Minimize∑n

i=1 (m̂ (xi )−m (xi ))2.

• If replace m (xi ) with yi , we will get perfect fit 0 since ash→ 0, m̂ (xi ) = yi .

• Another way to think about this,

n∑i=1

(m̂ (xi )− yi )2 =

n∑i=1

(m̂ (xi )−m (xi )− εi )2

=n∑

i=1

(m̂ (xi )−m (xi ))2︸︷︷︸what we want

+n∑

i=1

ε2i︸︷︷︸unrelated

− 2n∑

i=1

(m̂ (xi )−m (xi )) εi︸︷︷︸the trouble

.


• Expectation of trouble term:

En∑

i=1

1

nh

n∑j=1

K

(xi − xj

h

)εjεi =

1

nh

n∑i=1

K (0)σ2 =1

hσ2K (0)

• Cross validation

• Leave-one-out estimate m̂−i (xi ) = 1(n−1)h

∑nj 6=i K

(xj−xih

)yi

• Minimize cross-validation function

CV (h) =n∑

i=1

(m−i (xi )− yi )2

• Penalizing function

• Consistent trouble term estimate K (0) 1n

∑ni=1 (yi − m̂ (xi ))2

• Minimize penalizing function

G (h) =n∑

i=1

(m̂ (xi )− yi )2 + 2K (0)

1

n

n∑i=1

(yi − m̂ (xi ))2


Bias reduction by Jacknifing

• It is essentially equivalent to high order kernel.

• It doesn’t make any difference if you are just running a simplekernel regression.

• If the objective function is only convex with positive K (·),say, run a nonparametric quantile regression, thenoperationally the Jacknife method is very useful in preservingthe convexity of the objective function.


Uniform rate of convergence

• It is useful to obtain optimal bandwidth and optimal uniformconvergence rate, i.e., for supx∈X |γ̂ (x)− γ (x) |.

• Again, consider the bias-variance tradeoff.

• The bias supx∈X |γ̂ (x)− γ (x) | for rth order kernel is OP(hp).

• The error supx∈X |γ̂ (x)− E γ̂ (x) | is Op

((nhd

log n

)−1/2).

• Use Berstein inequality in the proof.

• Minimize total error OP(hp) + Op

((nhd

log n

)−1/2).


lecture 4: basic nonparametric estimationdoubleh/eco273/newslides4.pdfexample: use any kernel...

Documents