local scoring 1

7/27/2019 Local Scoring 1

1/20

A Note on Local Scoring and Weighted Local

Polynomial Regression in Generalized Additive

Models1

J.D. Opsomer

Iowa State University2Goran Kauermann

University of Glasgow3

November 14, 2002

1Running title: Weighted Generalized Additive Models.2Department of Statistics, Ames, IA 50011, USA; email: [email protected] of Statistics & Robertson Centre, Boyd Orr Building, Glasgow G12 8QQ,

UK.


2/20

Abstract

This article describes the asymptotic properties of local polynomial regression esti-mators for univariate and additive models when observation weights are included.

Such weighted additive models are a crucial component of local scoring, the widely

used estimation algorithm for generalized additive models described in Hastie and

Tibshirani (1990). The statistical properties of the univariate local polynomial es-

timator are shown to be asymptotically unaffected by the weights. In contrast, the

weights inflate the asymptotic variance of the additive model estimators. The impli-

cations of these findings for the local scoring estimators are discussed.

Key words: backfitting, additive model, GAM.


3/20

1 Introduction

Additive models and generalized additive models are popular multivariate nonpara-

metric regression techniques, widely used by statisticians and other scientists. Whilea number of different fitting methods are available for these models, the most popular

one is backfitting, an iterative algorithm proposed by Friedman and Stuetzle (1981).

For generalized additive models (GAM), the backfitting iteration is performed by

transforming the original observations to scores and fitting those in an iterative and

weighted manner using backfitting. The overall procedure is called local scoring

(Hastie and Tibshirani, 1990, p.140), which generalizes the Fisher scoringprocedure

described in McCullagh and Nelder (1989, p.42) . This procedure is implemented in

the gam() routine in Splus and frequently used in practice.

Backfitting and local scoring break the multivariate regression into a sequence

of univariate regressions, which are much easier to compute. Unlike in unrestricted

multi-dimensional smoothing, the resulting one-dimensional additive fits for each

of the covariates are easily displayed and interpreted. The ease of calculation and

interpretation has made these techniques widely used in statistical data exploration

and analysis. Recent applications of GAM fitting in the statistical literature include

Couper and Pepe (1997), Bio et al. (1998), Figueiras and Cadarso-Surez (2001),

Fricker and Hengartner (2001), Rothery and Roy (2001), and a large and growing

number of researchers use GAM as an exploratory tool in their day-to-day statisticalpractice.

The study of the statistical properties of these estimators is complicated by their

implicit definition as the convergence point of an iterative algorithm. For unweighted

backfitting estimators, equivalent explicit definitions are available and can sometimes

be used to derive the properties of the estimators. When the univariate smoothing

methods used in backfitting are projections onto specific subspaces (for instance,

parametric regression, regression and smoothing splines), it is possible to write the

overall additive model estimator as a projection. The estimator can then be studied

using this equivalent formulation. See Stone (1985), Wahba (1986), Gu et al. (1989),

Hardle and Hall (1993), Ansley and Kohn (1994) and Mammen et al. (1999) for

results using this approach.

Alternatively, it is possible to rewrite the backfitting estimator as the solution

of a large system of linear equations. Opsomer and Ruppert (1997) use this alter-

native definition to derive asymptotic mean squared error properties of the estima-

1


4/20

tors for bivariate additive models using local polynomial regression, a widely used

non-projection smoother, and Opsomer (2000) generalizes these results to models of

higher dimension. In this paper, we study the asymptotic properties of the back-

fitting estimator for additive models in the presence of observation weights. These

weights can be included in a regression model to account for heteroskedasticity, or

they can reflect the sampling design used for collecting the data. As we will show,

observation weights have an effect on both the asymptotic bias and variance, so that

users of additive models should be aware of it.

Another reason for studying the effect of observation weights in the context of

additive model is that these weights form an integral part of the local scoring algo-

rithm itself. In local scoring for generalized additive models, the backfitting step is

performed repeatedly using observation weights, and these weights are updated ateach iteration of the algorithm. Hence, local scoring is essentially iterative fitting of

additive models using weighted observations. The weights depend on the mean value

of the response and are updated iteratively. An outline of the local scoring algorithm

is provided in Figure 1, which is further explained in Section 2.

As for unweighted backfitting, some results on local scoring are available when

projection-type smoothers are used. Stone (1986) and Burman (1990) study these

estimators in the context of regression splines. Gu et al. (1989) use smoothing

splines within a Bayesian framework to find an asymptotic posterior distribution for

the estimators. Recently, Aerts et al. (2002) describe some theoretical results for

generalized additive models fitted with penalized regression splines.

If non-projection smoothers such as kernel-based methods are used, it is necessary

to consider the sequence of weighted additive models fits explicitly. To our knowledge,

that has not yet been done. The results of the current paper are a step towards the

study of the local scoring algorithm, and can also be used by other authors interested

in that estimator. In Kauermann and Opsomer (2002), we study the properties of

a local likelihood estimator which is closely related to the local scoring estimator,

and we rely heavily on the weighted additive model results that are described in thecurrent article.

A different approach to fitting GAMs has been explored by Linton (2000), who

replaces the backfitting component of local scoring by marginal integration, a non-

recursive method. While promising, the results from this research are not applicable

to the much more widely used local scoring estimators and will not be further dis-

cussed here. We refer however to Sperlich et al. (1999) for a comparison of backfitting

2


5/20

and integration estimators in additive models.

The outline of the paper is as follows. Section 2 introduces the statistical model

that is studied in this article and reviews the local scoring estimator. In Section

3, we derive the asymptotic results for weighted local polynomial regression and for

weighted additive models. Section 4 describes the implications of the previous section

for local scoring estimators.

2 The statistical model

In generalized additive models, the response variable Y is assumed to have an ex-

ponential family distribution with conditional mean = E(Y|X1, . . . , X D), which islinked to the predictors via

g() = + m1(X1) + . . . + mD(XD), (1)

where g() is a known, invertible function. If g() is the identity link and the errorsfollow a continuous density, this reduces to an additive model. Just as the additive

model can be considered an extension of linear regression, GAM can be thought of as

a nonparametric equivalent of the generalized linear modelof Nelder and Wedderburn

(1972). This class of models is quite broad and includes not only the additive model

itself but also nonparametric extensions to proportional-hazards, logit, log-linearregression and numerous other models.

The local scoring algorithm for fitting generalized additive models combines a

generalization of Fisher scoring, as commonly used for generalized linear models

(McCullagh and Nelder, 1989), with backfitting, an iterative procedure that reduces

the D-dimensional fitting problem to a sequence of 1-dimensional ones (Buja et al.

1989). Since it will be referred to frequently, we outline the local scoring algorithm

in Figure 1. In this figure, i = g(i) represents the (unknown) additive function

of the predictors for the ith observation (see expression (1)), Vti is an estimate of

Var(Yi) at iteration t (typically, Vti is a known function of ti), and S1 . . . , SD are

the smoother matrices corresponding to the nonparametric regression method used

for fitting the model (see Section 3).

In the next section, we begin by studying the situation where g() is the identifyfunction and the weights are a smooth function of the covariates. After introducing

a weighted univariate smoother and deriving its basic properties, we will study the

weighted backfitting estimator.

3


6/20

3 Weighted Additive Models

3.1 Weighted Local Polynomial Regression

In order to perform the weighted additive model step of the local scoring algorithm,

a weighted nonparametric smoother needs to be defined. We will focus here on local

polynomial regression, a popular smoothing technique, and define a weighted version

for use in generalized additive models. Hastie and Tibshirani (1990, p.7274) discuss

approaches to include weights into the various nonparametric regression techniques.

For local polynomial regression, they propose to multiply the observation weights

and the kernel weights, and use them in the weighted least-squares fit. We will

follow that approach here. For simplicity, we will only provide asymptotic results

for the case when the degree of the local polynomial is odd. This covers local linearsmoothing, the most commonly used kernel-based regression method.

Suppose that the data (Xi, Yi), i = 1, . . . , n is generated by the following model:

Yi = m(Xi) + v(Xi)1/2i (2)

where m() and v() are continuous, unknown functions over the support of Xi andthe i are independent and identically distributed random variables with mean 0 and

variance 1. Let X = (X1, . . . , X n)T and Y = (Y1, . . . , Y n)

T. Let K represent a kernel

function and h the corresponding bandwidth parameter, and let r(Xi), i = 1, . . . , nrepresent a set of observation weights, assumed here to be a function ofXi. Ifv were

known, an obvious choice for the weight function would be r() = v()1, but weconsider more general weight functions. The weighted local polynomial regression

estimator of degree p at a location x, written as m(x), is defined as the solution for

0 to the weighted mean squared error minimization centered at x,

min0,...,p

ni=1

K

Xi x

h

r(Xi) (Yi 0 1(Xi x) . . . p(Xi x)p)2 .

To obtain a nonparametric estimator for the function m(), this minimization isrepeated for every x at which a fit is needed. The solution to the minimization can

be written down explicitly as

m(x) = sTx Y = eT1 (X

Tx R

1/2KxR1/2Xx)

1XTx R1/2KxR

1/2Y, (3)

with ei a vector with a one in the ith position and zeros elsewhere, the matrix Kx

4


7/20

= diag{ 1h

K(X1xh

), . . . , 1h

K(Xnxh

)}, R = diag{r(Xi), . . . , r(Xn)},

Xx =

1 (X1 x) (X1 x)p...

.

.... .

.

..1 (Xn x) (Xn x)p

,

(see also Ruppert and Wand, 1994). Since R1/2 and Kx are diagonal matrices,

R1/2KxR1/2 = RKx = KxR. We prefer the first notation because it generalizes

readily to non-diagonal weight matrices and simplifies the matrix algebra in the

proofs.

We introduce some additional notation and assumptions before stating the asymp-

totic bias and variance results for m(x). Let f(x) represent the density of Xi. For

the kernel function K, write the moments of K as j(K) =

uj

K(u)du for any jand let R(K) =

K(u)2du. We make the following technical assumptions:

(AS.I) The kernel K is bounded and continuous and has compact support. Also,

p+1(K) = 0.

(AS.II) The density f is bounded, continuous and differentiable, has compact sup-

port and fX(x) > 0 for all x [ax, bx] = supp(f).

(AS.III) The weight function r is bounded, continuous and differentiable, and

r(x) > 0 for all x [ax, bx].(AS.IV) The mean function m is continuous and differentiable up to order p + 1

over (ax, bx).

(AS.V) The variance function v is continuous and v(x) > 0 for all x [ax, bx].

(AS.VI) As n , h 0 and nh .The following result is proven in the Appendix.

Theorem 3.1 For local polynomial fitting of degree p, for p > 0 and odd, the con-ditional bias and variance of m(x) for x (ax, bx) can be approximated by:

E(m(x) m(x)|X) = 1(p + 1)!

hp+1p+1(K)m(p+1)(x) + op(h

p+1).

and

Var(m(x)|X) = 1nh

R(K)v(x)f(x)1(1 + op(1)).

5


8/20

As discussed in Remark 1 of Ruppert and Wand (1994), the leading term of the

bias and variance does not depend on X. Nevertheless, this is still a conditional

result, because in general the unconditional asymptotic bias and variance are not

guaranteed to exist.

Theorem 3.1 implies that, asymptotically, the inclusion of observation weights

that are a smooth function of the covariate has no effect on the bias and variance

of the local polynomial regression estimator. As we will show below, however, that

is no longer true when local polynomial regression is used within an additive model

context.

3.2 Additive Models Using Weighted Smoothers

We now consider data generated by the additive model

Yi = + m1(X1i) + . . . + mD(XDi) + v(X1i, . . . , X Di)1/2i,

where v() is a continuous, bounded function. We are interested in the heteroskedasticcase since this is typically the situation used in the local scoring algorithm. Back-

fitting estimators for the additive model are usually defined as the solution of the

backfitting algorithm at convergence. The backfitting algorithm is shown in step

2(b) in Figure 1.

An equivalent definition for these estimators is to view them as the solutions to

the following set of estimating equations:

I S1 S1S2 I S2

......

. . ....

SD SD . . . I

m1

m2...

mD

=

S1

S2...

SD

Y, (4)

where S1, . . . , SD are n n smoother matrices for the D covariates. In the case oflocal polynomial regression, if sd,x represents the n 1 smoother vector that mapsthe vector of observations Y to its nonparametric mean function estimate at a point

x for the dth covariate (as equation (3) did in the univariate case), then

Sd =

sd,Xd1...

sd,Xdn

.

6


9/20

Equation (3) provides the explicit expression for the smoother vectors for weighted

local polynomial regression.

Expression (4) represents a system of nD equations in nD unknowns and is

solved through backfitting, but it is also possible, at least conceptually, to write the

estimators directly as

m1

m2...

mD

=

I S1 S1S2 I S2

......

. . ....

SD SD . . . I

1

S1

S2...

SD

Y M1CY,

with the matrices M and C defined in this equation.

For most smoothing methods in practice, including local polynomial regression,the matrix M as written here is not invertible. The smoothing matrices Sd need

to be replaced by the centered smoothing matrices Sd = (I 11T/n)Sd for alld = 1, . . . , D, where 1 is an n 1 vector of ones. The invertibility of the matrix Mcomposed of centered smoothers is discussed in Buja et al. (1989) and, for the case of

unweighted local polynomial fitting, in Opsomer and Ruppert (1997) and (Opsomer,

2000).

For simplicity, we will focus here on the case with D = 2 and the local polynomials

are of odd degree. The main results generalize to D > 2 using the recursive approach

of Opsomer (2000), but the expressions become much more complicated. Buja et al.

(1989) give explicit expressions for the md when D = 2:

m1 = {I (I S1S2)1(I S1)}Y W1Ym2 = {I (I S2S1)1(I S2)}Y W2Y, (5)

provided the inverses exist. Since these direct expressions correspond to the so-

lutions of the backfitting algorithm at convergence, it is possible to derive many of

the properties of backfitting estimators from them. In particular, the convergence of

the algorithm in Figure 1-2(b) and the uniqueness of the estimators (5) both follow

directly from the existence of the inverse of M.

The results from Opsomer and Ruppert (1997) will be generalized to the situation

in which a set of observation weights r(Xi), i = 1, . . . , n are used in the local poly-

nomial regression. Let r1(x1) = E(r(Xi)|X1i = x1) and r1(x2) = E(r(Xi)|X2i = x2),the conditional univariate weight functions. The following assumptions are made:

7


10/20

(AS.I) The kernel K is bounded and continuous and has compact support. Also,

p1+1(K), p2+1(K) = 0.

(AS.II) The design densities f, f1 and f2 are bounded, continuous and differen-tiable, have compact support and f(x1, x2) > 0 for all (x1, x2) supp(f). Thefirst derivatives of f1 and f2 have a finite number of sign changes over their

support.

(AS.III) The weight function r is bounded, continuous and differentiable, and

r(x1, x2) > 0 for all (x1, x2) supp(f). For any fixed x1, r(x1, x2)/x2 has afinite number of sign changes over supp(f), and similarly with both variables

interchanged. Also, the first derivatives of r1 and r2 have a finite number of

sign changes over supp(f).

(AS.IV) The additive functions m1, m2 are continuous and differentiable up to

order p1 + 1, p2 + 2, respectively, over supp(f).

(AS.V) The variance function v is continuous and v(x1, x2) > 0 for all (x1, x2) supp(f).

(AS.VI) As n , h1, h2 0 and nh1/ log(n), nh2/ log(n) .

Assumption (AS.III) on the conditional weight functions is rather technical innature, but should be satisfied by any reasonable weight function used in the lo-

cal scoring context. The remaining assumptions are the same as in Opsomer and

Ruppert (1997). The following result generalizes Lemmas 3.13.2 of Opsomer and

Ruppert (1997) and is proven in the Appendix.

Lemma 3.1 Under Assumptions (AS.I)-(AS.III), the following asymptotic approx-

imations hold uniformly over all elements of the matrices:

S

1

= S1

11T/n + o(11T/n) a.s.

S1S2 = T

12 + o(11

T/n) a.s.

where T12 is a matrix whose ijth element is

[T12]ij =1

n

f(X1i, X2j)

f1(X1i)f2(X2j)

r(X1i, X2j)

r1(X1i)r2(X2j)r(X1j, X2j) 1

n

8


11/20

The approximation for T12 simplifies to that given in Opsomer and Ruppert

(1997), Lemma 3.1 under equal weighting. In Lemma 3.2, Opsomer and Ruppert

(1997) provide sufficient conditions on the joint distribution ofX1i and X2i to ensure

that the additive model estimator is asymptotically unique. Because asymptotic

uniqueness of the additive model estimators depends on the invertability of (IT12),it follows directly from Lemma 3.1 that in the weighted case, both the distribution of

the Xi and the weight function will have an effect on the existence of the estimators

through the spectral radius of T12 (see Remark 3.1 in Opsomer and Ruppert (1997)

for details). Developing sufficient conditions guaranteeing asymptotic uniqueness in

the weighted case would therefore be very cumbersome and not very useful, since in

practice they cannot be checked. We will therefore make an additional assumption

guaranteeing invertibility:(AS.VII) There exists a matrix norm , such that I T12 < 1.This assumption and the uniform convergence results in Lemma 3.1 are sufficient to

prove that Lemma 3.2 in Opsomer and Ruppert (1997) holds for weighted additive

models. In particular, this guarantees that the estimators exist for sufficiently large

n and that backfitting converges to a unique solution.

Additional notation is needed before stating the next result. Let Dp representthe pth derivative operator, and let

Dpm =

dpm(X1)dxp

...dpm(Xn)

dxp

for any function m(). The main results of this section is stated in the followingtheorem, proven in the Appendix.

Theorem 3.2 Suppose that assumptions (AS.I)(AS.VII) hold and that the local

polynomials are of odd degreep1, p2. At the observation points(X1i, X2i), i = 1, . . . , n,

the conditional bias and variance of m1(X

1i) can be approximated by

E(m1(X1i) m1(X1i)|X1, X2) =1

p1 + 1!hp1+11 p1+1(K)

eTi (I T12)1Dp1+1m1 E(m(p1+1)1 (X1))

1

p2 + 1!hp2+12 p2+1(K)

eTi (I T12)1M2 E(m(p2+1)2 (X2))

+Op

1n

+ op(h

p1+11 + h

p2+12 ),

9


12/20

with

M2 =

E(m(p2+1)2 (X2)r(X1,X2)|X1=X11)

r1(X11)...

E(m(p2+1)2 (X2)r(X1,X2)|X1i=X1n)

r1(X1n)

,

and

Var(m1(Xi)|X1, X2) = 1nh1

R(K)

f1(X1i)

E(v(X1, X2)r(X1, X2)2|X1 = X1i)

r1(X1i)2+ op

1

nh1

.

As in Opsomer and Ruppert (1997), it is also possible to derive the properties of

the estimator for the additive mean function E(Y|X1, X2) = + m1(X1i) + m2(X2i).The bias of that estimator is the sum of the bias terms for m1(X1i) and m2(X2i),after removing the Op(1/

n) terms, and its variance is the sum of their variances

(see Opsomer and Ruppert (1997), Theorem 4.1 for details).

The theorem shows that the weighted additive model estimator is consistent and

has a bias of order Op(hp1+11 + h

p2+12 ), so that it has the same rates of convergence

as one-dimensional smoothing. However, the bias expression in Theorem 3.2 is quite

complicated in the weighted case, and it is not clear whether the bias is smaller or

larger than in the unweighted case.

The use of the weight function r also results in a variance of the same order as

the unweighted case, with a more complicated leading term. If the errors have a

constant variance v 2, then it is easy to see that the use of non-equal weights willincrease the asymptotic variance, since

E(r(X1, X2)2|X1 = X1i)

r1(X1i)2 1

for any weight function r. This is in marked contrast to the result in Theorem 3.1

for univariate weighted local polynomial regression, where the effect of weighting is

asymptotically negligible. If the variance function is known, however, then weighting

might still be a good idea, as the following corollary makes more precise.

Corollary 3.1 Suppose that r(X1, X2) = v(X1, X2)1 for all (X1, X2) supp(f).

Then, the variance of the additive model estimator is approximated by

Var(m1(X1i)|X1, X2) = 1nh1

R(K)

f1(X1i)E(v(X1, X2)|X1 = X1i) + op

1

nh1

.

10


13/20

This corollary shows that, if it is possible to weigh the observations by their

true variance function, then the effect of weighting on the asymptotic variance is

negligible. Note that the asymptotic bias in Theorem 3.2 remains affected by the

weights even in that case.

4 Local Scoring Estimators

We apply the results from Section 3 to the local scoring estimators for generalized

additive models. At any iteration t of the local scoring algorithm in Figure 1, the

weighted additive model provides consistent estimators for the additive model

E(zi) = + m1(X1i) + m2(X2i)

with weights

r(X1i, X2i) = g(i)

2V(i)1

where i = g1( + m1(X1i) + m2(X2i)) as calculated in the previous iteration t 1.

In general, it will not be possible to check that all the conditions on r stated in

(AS.III) hold for every r. If g and V are known functions of = E(Y|X1, X2), it ispossible to check whether the conditions hold for r(X1, X2) = g

()2V()1. For the

right choice of bandwidth and reasonably well-behaved data, however, it is reasonable

to assume that r(X1, X2) will then also be a continuous, differentiable function oversupp(f) with the technical smoothness properties mentioned in (AS.III).

The local scoring algorithm of Hastie and Tibshirani (1990) is difficult to analyze

for general linear smoothers. For generalized additive models fitted using local poly-

nomial regression, local scoring does not explicitly solve a set of score equations as in

the generalized linear model case (e.g. McCullagh and Nelder (1989), p.41). There-

fore, there is no explicit expression for what the Newton-Raphson steps converge to.

This was one of the main reasons for proposing the local likelihood estimation for

GAM in Kauermann and Opsomer (2002) as an easier-to-study alternative to local

scoring. Nevertheless, we can look at two simple cases to gain some insight in the

behavior of the local scoring algorithm.

First, it is easy to show that if the model has an identity link (i = i), the

local scoring algorithm in Figure 1, using the starting values proposed by Hastie and

Tibshirani (1990), is equivalent to a weighted additive model. Hence, in this case

only one iteration of the outer loop in Figure 1 is performed and the asymptotic

properties of the estimators are given in Theorem 3.2.

11


14/20

Second, consider the hypothetical one-step estimate where the true values of

the unknown quantities are used as starting values. This corresponds to the approach

used when an iterative algorithm solves explicit equations, since in that case the one-

step estimate using the true values is asymptotically equivalent to the fully iterated

solution (see Serfling, 1980, p.258). This will be the case if we can assume that

the local scoring estimator is consistent, for instance. Aerts et al. (2002) use this

approach for generalized additive model estimation with penalized regression splines.

At the true values, we have i = g1( + m1(X1i) + m2(X2i)) and

zi = + m1(X1i) + m2(X2i) + (Yi i)g(i).

Hence, E(zi) = + m1(X1i) + m2(X2i) and Var(zi) = g(i)

2V(i) = r(X1i, X2i)1,

so that Theorem 3.2 and Corollary 3.1 again apply directly to this case. If the weightfunction is not correctly specified, only Theorem 3.2 applies.

This implies that if we have starting values for i that are close to the true

values, the Hastie and Tibshirani (1990) local scoring algorithm should provide es-

timators with desirable statistical properties similar to those of additive models,

including asymptotic unbiasedness and one-dimensional nonparametric regression

convergence rates. These results generalize to the D-dimensional case, as done for

additive models in Opsomer (2000).

5 Conclusions

In the article, we have describe the asymptotic properties of additive models and

generalized additive models in the presence of observation weights. We have shown

that, unlike in univariate nonparametric regression, observation weights can poten-

tially inflate the variance and modify the bias in the additive model. The effect on

the asymptotic variance can be avoided, but only if the weights correspond to the

inverse of the variance of the model errors. Hence, if the weights are not variance-

related, for instance when the weighting comes from sampling design considerations,

the resulting estimator will have a larger variance then the unweighted estimator

(this is analogous to what happens with weighted least squares (WLS) estimators

in parametric linear regression). Overall, while the weights indeed affect the leading

terms of both the asymptotic bias and variance, they do not change the convergence

rates of the estimators.

12


15/20

We have discussed some of the implications of these findings for the widely used

local scoring estimators. In particular, if the model (and its variance) is well-specified

and the local scoring estimators are consistent, the effect of the weights is asymp-

totically negligible. Because of the iterative nature of the estimator, it is difficult to

prove this rigorously, however.

One important implication of this article is that effect of the weights cannot be

ignored in additive and generalized additive models. In addition to exploring some of

the consequences of the weights in this article, the results proven here will be helpful

for researchers working on these models. For instance, Kauermann and Opsomer

(2002) have used the results on weighted additive models in deriving the asymptotic

properties of local likelihood estimators for generalized additive models.

A Proofs

Proof of Theorem 3.1: To simplify notation, we will prove the theorem for the case

p = 1. The method of proof is entirely analogous to that of Ruppert and Wand

(1994), Theorem 2.1, and it can be generalized to arbitrary p following the approach

in their Theorem 4.1.

First, note that the bias can be written as

E(m(x)m(x)|X) = 12

eT1 (XTx R1/2KxR1/2Xx)1XTx RT/2KxR1/2(Qm(x)+Bm(x)),

with Qm(x) = ((X1 x)2, . . . , (Xn x)2)T h2m(x) and Bm(x) a vector of Taylorseries remainder terms. The latter is of smaller order than the terms in Qm(x),

provided this is non-zero. Now,

(1

nXTx R

1/2KxR1/2Xx)

1 =

(r(x)f(x))1 + op(1) D(rf)(x)(r(x)f(x))2 + op(1)

D(rf)(x)(r(x)f(x))2

+ op(1) (2(K)r(x)f(x)h

2

)

1

+ op(h

2

)

and

1

nXTx R

1/2KxR1/2Qm(x) = h

2

2(K)r(x)f(x) + op(1)

h24(K)D(rf)(x) + op(h2)

m(x),

so that

E(m(x) m(x)|X) = 12

h22(K)m(x) + op(h

2).

13


16/20

For the variance, we need to approximate

1

nXTx R

1/2KxR1/2V R1/2KxR

1/2Xx =

1

hr(x)2f(x)v(x)

R(K) h1(K

2)

h1(K2) h22(K

2)

(1 + op(1)),

so that

Var(m(x)|X) = 1nh

R(K)v(x)f(x)1(1 + op(1)).

Proof of Lemma 3.1: The proof will follow the same approach as that in Opsomer

and Ruppert (1997) Lemma 3.1. We begin by showing that the statements in the

lemma hold in probability, using the approximation

[S1]ij =1

nh1r1(X1i)

1f1(X1i)1K

X1j X1i

h1

r(X1j, X2j)(1 + op(1))

For the first statement, the reasoning is completely analogous to that in Opsomer

and Ruppert (1997) Lemma 3.1. For the second statement,

[S1S2]ij =1

n2h1h2

r(X1j, X2j)

r1(X1i)f1(X1i)

1(1 + op(1))

n

k=1

r(X1k, X2k)

r2(X2k)f2(X2k)

1KX1k X1i

h1 KX2k X2j

h2 =

1

n

f(X1i, X2j)

f1(X1i)f2(X2j)

r(X1i, X2j)

r1(X1i)r2(X2j)r(X1j , X2j)(1 + op(1))

The second statement in the lemma then holds in probability, from this approxima-

tion and the first statement.

Because of the additional assumptions on the weight function in (AS.III), the

same approach as in the proof of Opsomer and Ruppert (1997) Lemma 3.1 can be

followed to prove the uniform convergence.

Proof of Theorem3.2: The proof will follow the approach used in proving Theorem

4.1 in Opsomer and Ruppert (1997), and for simplicity, we consider only the case p1 =

p2 = 1. We let Q1 = (sT1,X1

Qm1(X1), . . . , sT1,Xn

Qm1(Xn))T and Q1 = (I11T/n)Q1,

with analogous definitions holding for Q2 and Q2. It follows directly from Theorem

3.1 and equation (5) that

E(m1 m1) = 12

(I S2S1)1(Q1 S1Q2) + Op

1n

+ op(h

21 + h

22)

14


17/20

with the Op(1/

n) term due to the presence of the model intercept. Q1 can be

approximated as in the proof of Theorem 3.1 and shown to be asymptotically un-

affected by the presence of the weights. Unlike in the unweighted case, however, a

similar calculation for S1Q2 will involve terms of the form

sT1,x1D2m2 =

E(m2(X2)r(X1, X2)|X1 = x1)r1(x1)

+ op(1),

so that

S1Q2 = 2(K)h22M

2 + op(h

22).

Hence, the conditional bias of m1 can be approximated by

E(m1 m1|X1, X2) =1

2 h

2

12(K)

(I T

12)

1

D2

m1 E(m

1(X1))

12

h222(K)

(I T12)1M2 E(m2(X2))

+ Op

1n

+ op(h

21 + h

22).

For the variance approximation, we start from the exact variance

Var(m1(Xi)|X1, X2) = eTi W1V WT1 ei

with V = diag{v(X11, X21), . . . , v(X1n, X2n)}. Following the approach in the proofof Theorem 4.1 in Opsomer and Ruppert (1997), the leading variance term can be

shown to be eTi S1V ST1 ei. Using the appraoch from proof of Theorem 3.1 again toapproximate this term, we find

Var(m1(X1i)|X1, X2) = 1nh1

R(K)

f1(X1i)

E(v(X1, X2)r(X1, X2)2|X1 = X1i)

r1(X1i)2+ op

1

nh1

.

References

Aerts, M., G. Claeskens, and M. Wand (2002). Some theory for penalized spline

generalized additive models. Journal of Statistical Planning and Inference 103,

455470.

Ansley, C. F. and R. Kohn (1994). Convergence of the backfitting algorithm for

additive models. Journal of the Australian Mathematical Society (Series A) 57,

316329.

15


18/20

Bio, A., R. Alkemade, and A. Barendregt (1998). Determining alternative mod-

els for vegetation response analysis: a non-parametric approach. Journal of

Vegetation Science 9, 516.

Buja, A., T. J. Hastie, and R. J. Tibshirani (1989). Linear smoothers and additive

models. Annals of Statistics 17, 453555.

Burman, P. (1990). Estimation of generalized additive models. Journal of Multi-

variate Analysis 32, 230255.

Couper, D. and M. S. Pepe (1997). Modelling prevalence of a condition: Chronic

graft-versus-host disease after bone marrow transplantation. Statistics in

Medicine 16, 15511571.

Figueiras, A. and C. Cadarso-Surez (2001). Application of nonparametric modelsfor calculating odds ratios and their confidence intervals for continuous expo-

sures. American Journal of Epidemiology 154 (3), 264275.

Fricker, Ronald D., J. and N. W. Hengartner (2001). Environmental equity and

the distribution of toxic release inventory and other environmentally unde-

sirable sites in metropolitan New York City. Environmental and Ecological

Statistics 8(1), 3352.

Friedman, J. H. and W. Stuetzle (1981). Projection pursuit regression. Journal of

the American Statistical Association 76, 817823.

Gu, C., D. M. Bates, Z. Chen, and G. Wahba (1989). The computation of gcv

functions through householder tridiagonalization with application to the fitting

of interaction spline models. SIAM Journal of Matrix Analysis Applications 10,

457480.

Hardle, W. and P. Hall (1993). On the backfitting algorithm for additive regression

models. Statistica Neerlandica 47, 4357.

Hastie, T. J. and R. J. Tibshirani (1990). Generalized Additive Models. Washing-

ton, D. C.: Chapman and Hall.

Kauermann, G. and J. D. Opsomer (2002). Local likelihood estimation in gener-

alized additive models. To appear in Scandinavian Journal of Statistics.

Linton, O. B. (2000). Efficient estimation of generalized additive nonparametric

regression models. Econometric Theory 16, 502523.

16


19/20

Mammen, E., O. Linton, and J. Nielsen (1999). The existence and asymptotic

properties of a backfitting projection algorithm under weak conditions. Annals

of Statistics 27, 14431490.

McCullagh, P. and J. A. Nelder (1989). Generalized Linear Models (2 ed.). London:

Chapman and Hall.

Nelder, J. A. and R. W. M. Wedderburn (1972). Generalized linear models. Journal

of the Royal Statistical Association, Series A 135, 370384.

Opsomer, J. D. (2000). Asymptotic properties of backfitting estimators. Journal

of Multivariate Analysis 73, 166179.

Opsomer, J.-D. and D. Ruppert (1997). Fitting a bivariate additive model by local

polynomial regression. Annals of Statistics 25, 186211.

Rothery, P. and D. B. Roy (2001). Application of generalized additive models to

butterfly transect count data. Journal of Applied Statistics 28(7), 897909.

Ruppert, D. and M. P. Wand (1994). Multivariate locally weighted least squares

regression. Annals of Statistics 22, 13461370.

Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. New

York: John Wiley & Sons.

Sperlich, S., O. Linton, and W. Hardle (1999). Integration and backfitting methods

in additive models finite sample properties and comparison. Test 8, 419459.

Stone, C. J. (1985). Additive regression and other nonparametric models. Annals

of Statistics 13, 689705.

Stone, C. J. (1986). The dimensionality reduction principle for generalized additive

models. Annals of Statistics 14, 590606.

Wahba, G. (1986). Partial and interaction splines for the semiparametric esti-

mation of functions of several variables. In T. J. Boardman (Ed.), Computer

Science and Statistics: Proceedings of the 18th Symposium on the Interface,pp. 7580. American Statistical Association.

17


20/20

1. Initialize: t = 0, = g(y), mt1 = . . . = mtD = 0, with

mtd = (mtd(Xd1), . . . , m

td(Xdn))

T.

2. Update:

(a) Transformation/reweighing: construct an adjusted dependent variable

zi = ti + (yi ti)

ii

t

, i = 1, . . . , n

with ti = +D

d=1 mtd(Xdi) and ti = g1(ti), and the weights

wi =

ii

2t

(Vti )1 , i = 1, . . . , n .

(b) Backfitting: fit the weighted additive model to z = (z1, . . . , z n)T to obtain

estimated functions mt+1d () through backfittingi. Initialize: s = 0, = z, msd = m

td, d = 1, . . . , D

ii. Update:

ms+11 = S1(z d=1

msd)

...

ms+1D = SD(z d=D

msd)

and set s = s + 1.

iii. Repeat step ii until the estimated functions do not change, and set

m

t+1

d = ms+1

d , d = 1, . . . , D, and t = t + 1.

3. Repeat step 2 until the estimated functions do not change

Figure 1: Local scoring algorithm (Hastie and Tibshirani (1990), p.141).

18

local scoring 1

Documents