@let@token lecture 9: nonparametric regression (1)

Introduction Estimations: local modelling Cross Validation Assignments

Lecture 9: Nonparametric Regression (1)

Applied Statistics 2015

1 / 22


An example: Pick-It Lottery

The New Jersey Pick-It Lottery is a daily numbers game run by thestate of New Jersey. Buying a ticket entitles a player to pick a numberbetween 0 and 999. Half of the money bet each day goes into theprize pool. (The state takes the other half.) The state picks a winningnumber at random, and the prize pool is shared equally among allwinning tickets.

We analyze the first 254 drawings after the lottery started in 1975.Figure 1 shows a scatterplot of the winning numbers and their payoffs.

2 / 22



●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●●

●●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 200 400 600 800 1000

200

400

600

800

Number

Pay

off

Although all numbers are equally likely to win, numbers chosen byfewer people have bigger payoffs if they win because the prize is sharedamong fewer tickets.Question: can we find some pattern from the data? Are there numberswith larger payoffs? 3 / 22


An example: Pick-It LotteryThe question can be answered by regression analysis.Linear regression: assumes the linear relation between payoff and win-ning number. The blue dashed line is the least squares regression line,which shows a general trend of higher payoffs for larger winning num-bers.

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●●

●●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 200 400 600 800 1000

200

400

600

800

Number

Pay

off

4 / 22


Nonparametric regression

Nonparametric regression do not assume any parametric structure. It isalso known as “learning a function” in the field of machine learning. Thereare n pairs of observations (x1, Y1), . . . , (xn, Yn). The response variableY is related to the covariate x by the equations

Yi = r(xi) + εi, i = 1, . . . , n

where r is the regression function, E(εi) = 0 and Var(εi) = σ2.

Here, we want to estimate r under weak assumptions withoutassuming a parametric model of r.

We are treating the covariate xi as fixed – fixed design. For randomdesign, the data are (Xi, Yi), i = 1, . . . , n and r(x) is the conditionalexpectation of Y given that X = x: r(x) = E(Y |X = x).

5 / 22


A general idea behind different estimations

Note that Yi is the sum of r(xi) and some error, the expected valueof which is zero. This motivates to estimate r(x) by the average ofthose Yi where Xi is “close” to x.

Different ways of averaging and different measures of closeness leadto different estimators.

6 / 22


An Example

The data are n = 60 pairs of observations from a certain regressionmodel.

How to construc rn, an etimator of r?

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

2

x

Y

7 / 22


Estimator: RegressogramA regressogram is construced in a similar manner as that for histogram.Here we consider that xi ∈ [0, 1]. Devide the unit interval into m equallyspaced bins denoted by B1, B2, . . . , Bm. Define the regressogram,

gn(x) =1

kj

∑i:xi∈Bj

Yi, for x ∈ Bj ,

where kj is the number of points in Bj . Here we use the convention 00 = 0.

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

2

Regressogram (m=10)

x

Y

8 / 22


Estimator: Local averageFix h > 0,

rn(x) =

∑ni=1 I(x− h < xi ≤ x+ h)Yi∑ni=1 I(x− h < xi ≤ x+ h)

.

This is also called naive kernel estimator: rn(x) =∑n

i=112 1[−1,1)(

x−xih )Yi∑n

i=112 1[−1,1)(

x−xih )

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

2

Local Average (h=0.2)

x

Y

9 / 22


Estimator: Local averageFix h > 0,

rn(x) =

∑ni=1 I(x− h < xi ≤ x+ h)Yi∑ni=1 I(x− h < xi ≤ x+ h)

.

This is also called naive kernel estimator: rn(x) =∑n

i=112 1[−1,1)(

x−xih )Yi∑n

i=112 1[−1,1)(

x−xih )

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

2Local Average (h=0.2)

x

Y

9 / 22


Nadaraya-Watson EstimatorReplacing the box kernel by a general kernel in the local averageestimator, we obtain the Nadaraya-Watson estimator of r:

rn(x) =

∑ni=1K

(x−xi

h

)Yi∑n

i=1K(x−xi

h

) .

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

2

Nadaraya−Watson (h=0.2, kernel=guassian)

x

Y

10 / 22


The black curve indicates r(x), the real regression function.The underlying mode is: Yi = sin(8xi)− xi + x3i + εi, withεi ∼ N(0, 0.5).

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

2

x

Y

11 / 22


Some comments

The three estimators can be written in the same form:

rn(x) =

n∑i=1

li(x)Yi.

Define the class of piecewise constant functionsFm = {m : m(t) =

∑mi=1 ciI(t ∈ Bi), ci ∈ R} . Then the

regressogram

gn = argminm∈Fm

n∑i=1

(Yi − m(xi))2.

The Nadaraya-Watson estimator can be considered as locally fittinga constant to to the data:

rn(x) = argminc∈R

n∑i=1

K

(x− xih

)(Yi − c)2.

12 / 22


Risk

For fixed x,

MSE(rn(x)) = E((rn(x)− r(x))2

)= (E(rn(x))− r(x))2+Var(rn(x)) .

For global index, we consider

MISE(rn) =E

(∫(rn(x)− r(x))2dx

)=

∫(E(rn(x))− r(x))2 dx+

∫Var(rn(x)) dx;

and the average mean square error

AMSE(rn) =1

n

n∑i=1

E((rn(xi)− r(xi))2

).

13 / 22


Cross Validation: choosing bandwidths

Take the AMSE as the criteria. We would like to choose h to minimize

AMSE(h) =1

n

n∑i=1

E((rnh(xi)− r(xi))2

).

Since r is unknown, we need to estimate AMSE(h). As a first guess,one might think of the average residual sums of squares

1

n

n∑i=1

(Yi − rnh(xi))2 .

This turns out to be a bad choice. It usually leads to undersmoothing(overfitting). The reason is that it favors estimates which are toowell-adapted for the data and are not reasonsable for new observations.

14 / 22



We estimate the risk using the leave-one-out cross validation scoredefined as

CV (h) =1

n

n∑i=1

(Yi − r(i)nh(xi)

)2,

where r(i)nh(xi) is the estimator based on {(xj , Yj), 1 ≤ j ≤ n, j 6= i}, i.e.

ommitting the observation (xi, Yi).

15 / 22



In order to compute the CV score, there is no need to fit the curve ntimes. Let rnh(x) =

∑ni=1 li(x)Yi. Then CV (h) can be written as

CV (h) =1

n

n∑i=1

(Yi − rnh(xi)1− li(xi)

)2

.

Hence

hcv = argminh

CV (h) = argminh

1

n

n∑i=1

(Yi − rnh(xi)1− li(xi)

)2

.

16 / 22


An example: Pick-It Lotteryh = 1, 10, 20, 50.

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●●●

●

●

●●

●●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●●

●

●●

●

●●

●

●

●●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

0 200 400 600 800 1000

200

400

600

800

Number

Pay

off

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●●●

●

●

●●

●●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●●

●

●●

●

●●

●

●

●●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

0 200 400 600 800 1000

200

400

600

800

Number

Pay

off

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●●●

●

●

●●

●●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●●

●

●●

●

●●

●

●

●●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

0 200 400 600 800 1000

200

400

600

800

Number

Pay

off

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●●●

●

●

●●

●●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●●

●

●●

●

●●

●

●

●●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

0 200 400 600 800 1000

200

400

600

800

Number

Pay

off

17 / 22



●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 200 400 600 800 1000

200

400

600

800

Number

Pay

off

The curve suggests that there were larger payoffs for numbers in theinterval [0, 100]. People tended to pick numbers starting with 2 and3. This pattern disappeared after 1976. People noticed the patternand changed their choices.

18 / 22


Lectures 10 – 12

19 / 22


Group Presentation (April 20)

Group 16

The data are the infant-mortality rates (infant death per 1000 livebirths) and GDP per capita (in U.S. dollars) for 193 countries in2003.Make a scatter plot of the data.Estimate the regression function with different approaches. Give yourcomments.

20 / 22



Group 17

Downloand the dataset CMB fromhttp://www.stat.cmu.edu/~larry/all-of-nonpar/data.html.Consider power as response variable and Multipole as covariate.Fit a model based on the first 400 observations:

Make a scatter plot of the data. Consider Nadaraya-Watsonestimator. Using CV (h) score to choose h. Present your estimate.

Repeat the procedure above, but for the whole data set.

21 / 22

http://www.stat.cmu.edu/~larry/all-of-nonpar/data.html



Group 18

Consider the following model.

Yi = r(xi) + εi,

where r(x) = x2 − 2x, x ∈ [0, 2] and εi iid from N(0, 0.52).

Simulate one sample {(xi, Yi), i = 1, . . . , 100}. You can choosexi = i/50. Fit a Nadaraya-Watson estimator to this data.Choose your kernel and h. Estimate MSE(rn(1)) by simulation.Hint: simulate many samples from the model and use the samplecounterpart as the estimator.Repeat step 2 for a different h. Compare the results and give yourcomments.

22 / 22

@let@token lecture 9: nonparametric regression (1)

Documents