local scoring 1
TRANSCRIPT
-
7/27/2019 Local Scoring 1
1/20
A Note on Local Scoring and Weighted Local
Polynomial Regression in Generalized Additive
Models1
J.D. Opsomer
Iowa State University2Goran Kauermann
University of Glasgow3
November 14, 2002
1Running title: Weighted Generalized Additive Models.2Department of Statistics, Ames, IA 50011, USA; email: [email protected] of Statistics & Robertson Centre, Boyd Orr Building, Glasgow G12 8QQ,
UK.
-
7/27/2019 Local Scoring 1
2/20
Abstract
This article describes the asymptotic properties of local polynomial regression esti-mators for univariate and additive models when observation weights are included.
Such weighted additive models are a crucial component of local scoring, the widely
used estimation algorithm for generalized additive models described in Hastie and
Tibshirani (1990). The statistical properties of the univariate local polynomial es-
timator are shown to be asymptotically unaffected by the weights. In contrast, the
weights inflate the asymptotic variance of the additive model estimators. The impli-
cations of these findings for the local scoring estimators are discussed.
Key words: backfitting, additive model, GAM.
-
7/27/2019 Local Scoring 1
3/20
1 Introduction
Additive models and generalized additive models are popular multivariate nonpara-
metric regression techniques, widely used by statisticians and other scientists. Whilea number of different fitting methods are available for these models, the most popular
one is backfitting, an iterative algorithm proposed by Friedman and Stuetzle (1981).
For generalized additive models (GAM), the backfitting iteration is performed by
transforming the original observations to scores and fitting those in an iterative and
weighted manner using backfitting. The overall procedure is called local scoring
(Hastie and Tibshirani, 1990, p.140), which generalizes the Fisher scoringprocedure
described in McCullagh and Nelder (1989, p.42) . This procedure is implemented in
the gam() routine in Splus and frequently used in practice.
Backfitting and local scoring break the multivariate regression into a sequence
of univariate regressions, which are much easier to compute. Unlike in unrestricted
multi-dimensional smoothing, the resulting one-dimensional additive fits for each
of the covariates are easily displayed and interpreted. The ease of calculation and
interpretation has made these techniques widely used in statistical data exploration
and analysis. Recent applications of GAM fitting in the statistical literature include
Couper and Pepe (1997), Bio et al. (1998), Figueiras and Cadarso-Surez (2001),
Fricker and Hengartner (2001), Rothery and Roy (2001), and a large and growing
number of researchers use GAM as an exploratory tool in their day-to-day statisticalpractice.
The study of the statistical properties of these estimators is complicated by their
implicit definition as the convergence point of an iterative algorithm. For unweighted
backfitting estimators, equivalent explicit definitions are available and can sometimes
be used to derive the properties of the estimators. When the univariate smoothing
methods used in backfitting are projections onto specific subspaces (for instance,
parametric regression, regression and smoothing splines), it is possible to write the
overall additive model estimator as a projection. The estimator can then be studied
using this equivalent formulation. See Stone (1985), Wahba (1986), Gu et al. (1989),
Hardle and Hall (1993), Ansley and Kohn (1994) and Mammen et al. (1999) for
results using this approach.
Alternatively, it is possible to rewrite the backfitting estimator as the solution
of a large system of linear equations. Opsomer and Ruppert (1997) use this alter-
native definition to derive asymptotic mean squared error properties of the estima-
1
-
7/27/2019 Local Scoring 1
4/20
tors for bivariate additive models using local polynomial regression, a widely used
non-projection smoother, and Opsomer (2000) generalizes these results to models of
higher dimension. In this paper, we study the asymptotic properties of the back-
fitting estimator for additive models in the presence of observation weights. These
weights can be included in a regression model to account for heteroskedasticity, or
they can reflect the sampling design used for collecting the data. As we will show,
observation weights have an effect on both the asymptotic bias and variance, so that
users of additive models should be aware of it.
Another reason for studying the effect of observation weights in the context of
additive model is that these weights form an integral part of the local scoring algo-
rithm itself. In local scoring for generalized additive models, the backfitting step is
performed repeatedly using observation weights, and these weights are updated ateach iteration of the algorithm. Hence, local scoring is essentially iterative fitting of
additive models using weighted observations. The weights depend on the mean value
of the response and are updated iteratively. An outline of the local scoring algorithm
is provided in Figure 1, which is further explained in Section 2.
As for unweighted backfitting, some results on local scoring are available when
projection-type smoothers are used. Stone (1986) and Burman (1990) study these
estimators in the context of regression splines. Gu et al. (1989) use smoothing
splines within a Bayesian framework to find an asymptotic posterior distribution for
the estimators. Recently, Aerts et al. (2002) describe some theoretical results for
generalized additive models fitted with penalized regression splines.
If non-projection smoothers such as kernel-based methods are used, it is necessary
to consider the sequence of weighted additive models fits explicitly. To our knowledge,
that has not yet been done. The results of the current paper are a step towards the
study of the local scoring algorithm, and can also be used by other authors interested
in that estimator. In Kauermann and Opsomer (2002), we study the properties of
a local likelihood estimator which is closely related to the local scoring estimator,
and we rely heavily on the weighted additive model results that are described in thecurrent article.
A different approach to fitting GAMs has been explored by Linton (2000), who
replaces the backfitting component of local scoring by marginal integration, a non-
recursive method. While promising, the results from this research are not applicable
to the much more widely used local scoring estimators and will not be further dis-
cussed here. We refer however to Sperlich et al. (1999) for a comparison of backfitting
2
-
7/27/2019 Local Scoring 1
5/20
and integration estimators in additive models.
The outline of the paper is as follows. Section 2 introduces the statistical model
that is studied in this article and reviews the local scoring estimator. In Section
3, we derive the asymptotic results for weighted local polynomial regression and for
weighted additive models. Section 4 describes the implications of the previous section
for local scoring estimators.
2 The statistical model
In generalized additive models, the response variable Y is assumed to have an ex-
ponential family distribution with conditional mean = E(Y|X1, . . . , X D), which islinked to the predictors via
g() = + m1(X1) + . . . + mD(XD), (1)
where g() is a known, invertible function. If g() is the identity link and the errorsfollow a continuous density, this reduces to an additive model. Just as the additive
model can be considered an extension of linear regression, GAM can be thought of as
a nonparametric equivalent of the generalized linear modelof Nelder and Wedderburn
(1972). This class of models is quite broad and includes not only the additive model
itself but also nonparametric extensions to proportional-hazards, logit, log-linearregression and numerous other models.
The local scoring algorithm for fitting generalized additive models combines a
generalization of Fisher scoring, as commonly used for generalized linear models
(McCullagh and Nelder, 1989), with backfitting, an iterative procedure that reduces
the D-dimensional fitting problem to a sequence of 1-dimensional ones (Buja et al.
1989). Since it will be referred to frequently, we outline the local scoring algorithm
in Figure 1. In this figure, i = g(i) represents the (unknown) additive function
of the predictors for the ith observation (see expression (1)), Vti is an estimate of
Var(Yi) at iteration t (typically, Vti is a known function of ti), and S1 . . . , SD are
the smoother matrices corresponding to the nonparametric regression method used
for fitting the model (see Section 3).
In the next section, we begin by studying the situation where g() is the identifyfunction and the weights are a smooth function of the covariates. After introducing
a weighted univariate smoother and deriving its basic properties, we will study the
weighted backfitting estimator.
3
-
7/27/2019 Local Scoring 1
6/20
3 Weighted Additive Models
3.1 Weighted Local Polynomial Regression
In order to perform the weighted additive model step of the local scoring algorithm,
a weighted nonparametric smoother needs to be defined. We will focus here on local
polynomial regression, a popular smoothing technique, and define a weighted version
for use in generalized additive models. Hastie and Tibshirani (1990, p.7274) discuss
approaches to include weights into the various nonparametric regression techniques.
For local polynomial regression, they propose to multiply the observation weights
and the kernel weights, and use them in the weighted least-squares fit. We will
follow that approach here. For simplicity, we will only provide asymptotic results
for the case when the degree of the local polynomial is odd. This covers local linearsmoothing, the most commonly used kernel-based regression method.
Suppose that the data (Xi, Yi), i = 1, . . . , n is generated by the following model:
Yi = m(Xi) + v(Xi)1/2i (2)
where m() and v() are continuous, unknown functions over the support of Xi andthe i are independent and identically distributed random variables with mean 0 and
variance 1. Let X = (X1, . . . , X n)T and Y = (Y1, . . . , Y n)
T. Let K represent a kernel
function and h the corresponding bandwidth parameter, and let r(Xi), i = 1, . . . , nrepresent a set of observation weights, assumed here to be a function ofXi. Ifv were
known, an obvious choice for the weight function would be r() = v()1, but weconsider more general weight functions. The weighted local polynomial regression
estimator of degree p at a location x, written as m(x), is defined as the solution for
0 to the weighted mean squared error minimization centered at x,
min0,...,p
ni=1
K
Xi x
h
r(Xi) (Yi 0 1(Xi x) . . . p(Xi x)p)2 .
To obtain a nonparametric estimator for the function m(), this minimization isrepeated for every x at which a fit is needed. The solution to the minimization can
be written down explicitly as
m(x) = sTx Y = eT1 (X
Tx R
1/2KxR1/2Xx)
1XTx R1/2KxR
1/2Y, (3)
with ei a vector with a one in the ith position and zeros elsewhere, the matrix Kx
4
-
7/27/2019 Local Scoring 1
7/20
= diag{ 1h
K(X1xh
), . . . , 1h
K(Xnxh
)}, R = diag{r(Xi), . . . , r(Xn)},
Xx =
1 (X1 x) (X1 x)p...
.
.... .
.
..1 (Xn x) (Xn x)p
,
(see also Ruppert and Wand, 1994). Since R1/2 and Kx are diagonal matrices,
R1/2KxR1/2 = RKx = KxR. We prefer the first notation because it generalizes
readily to non-diagonal weight matrices and simplifies the matrix algebra in the
proofs.
We introduce some additional notation and assumptions before stating the asymp-
totic bias and variance results for m(x). Let f(x) represent the density of Xi. For
the kernel function K, write the moments of K as j(K) =
uj
K(u)du for any jand let R(K) =
K(u)2du. We make the following technical assumptions:
(AS.I) The kernel K is bounded and continuous and has compact support. Also,
p+1(K) = 0.
(AS.II) The density f is bounded, continuous and differentiable, has compact sup-
port and fX(x) > 0 for all x [ax, bx] = supp(f).
(AS.III) The weight function r is bounded, continuous and differentiable, and
r(x) > 0 for all x [ax, bx].(AS.IV) The mean function m is continuous and differentiable up to order p + 1
over (ax, bx).
(AS.V) The variance function v is continuous and v(x) > 0 for all x [ax, bx].
(AS.VI) As n , h 0 and nh .The following result is proven in the Appendix.
Theorem 3.1 For local polynomial fitting of degree p, for p > 0 and odd, the con-ditional bias and variance of m(x) for x (ax, bx) can be approximated by:
E(m(x) m(x)|X) = 1(p + 1)!
hp+1p+1(K)m(p+1)(x) + op(h
p+1).
and
Var(m(x)|X) = 1nh
R(K)v(x)f(x)1(1 + op(1)).
5
-
7/27/2019 Local Scoring 1
8/20
As discussed in Remark 1 of Ruppert and Wand (1994), the leading term of the
bias and variance does not depend on X. Nevertheless, this is still a conditional
result, because in general the unconditional asymptotic bias and variance are not
guaranteed to exist.
Theorem 3.1 implies that, asymptotically, the inclusion of observation weights
that are a smooth function of the covariate has no effect on the bias and variance
of the local polynomial regression estimator. As we will show below, however, that
is no longer true when local polynomial regression is used within an additive model
context.
3.2 Additive Models Using Weighted Smoothers
We now consider data generated by the additive model
Yi = + m1(X1i) + . . . + mD(XDi) + v(X1i, . . . , X Di)1/2i,
where v() is a continuous, bounded function. We are interested in the heteroskedasticcase since this is typically the situation used in the local scoring algorithm. Back-
fitting estimators for the additive model are usually defined as the solution of the
backfitting algorithm at convergence. The backfitting algorithm is shown in step
2(b) in Figure 1.
An equivalent definition for these estimators is to view them as the solutions to
the following set of estimating equations:
I S1 S1S2 I S2
......
. . ....
SD SD . . . I
m1
m2...
mD
=
S1
S2...
SD
Y, (4)
where S1, . . . , SD are n n smoother matrices for the D covariates. In the case oflocal polynomial regression, if sd,x represents the n 1 smoother vector that mapsthe vector of observations Y to its nonparametric mean function estimate at a point
x for the dth covariate (as equation (3) did in the univariate case), then
Sd =
sd,Xd1...
sd,Xdn
.
6
-
7/27/2019 Local Scoring 1
9/20
Equation (3) provides the explicit expression for the smoother vectors for weighted
local polynomial regression.
Expression (4) represents a system of nD equations in nD unknowns and is
solved through backfitting, but it is also possible, at least conceptually, to write the
estimators directly as
m1
m2...
mD
=
I S1 S1S2 I S2
......
. . ....
SD SD . . . I
1
S1
S2...
SD
Y M1CY,
with the matrices M and C defined in this equation.
For most smoothing methods in practice, including local polynomial regression,the matrix M as written here is not invertible. The smoothing matrices Sd need
to be replaced by the centered smoothing matrices Sd = (I 11T/n)Sd for alld = 1, . . . , D, where 1 is an n 1 vector of ones. The invertibility of the matrix Mcomposed of centered smoothers is discussed in Buja et al. (1989) and, for the case of
unweighted local polynomial fitting, in Opsomer and Ruppert (1997) and (Opsomer,
2000).
For simplicity, we will focus here on the case with D = 2 and the local polynomials
are of odd degree. The main results generalize to D > 2 using the recursive approach
of Opsomer (2000), but the expressions become much more complicated. Buja et al.
(1989) give explicit expressions for the md when D = 2:
m1 = {I (I S1S2)1(I S1)}Y W1Ym2 = {I (I S2S1)1(I S2)}Y W2Y, (5)
provided the inverses exist. Since these direct expressions correspond to the so-
lutions of the backfitting algorithm at convergence, it is possible to derive many of
the properties of backfitting estimators from them. In particular, the convergence of
the algorithm in Figure 1-2(b) and the uniqueness of the estimators (5) both follow
directly from the existence of the inverse of M.
The results from Opsomer and Ruppert (1997) will be generalized to the situation
in which a set of observation weights r(Xi), i = 1, . . . , n are used in the local poly-
nomial regression. Let r1(x1) = E(r(Xi)|X1i = x1) and r1(x2) = E(r(Xi)|X2i = x2),the conditional univariate weight functions. The following assumptions are made:
7
-
7/27/2019 Local Scoring 1
10/20
(AS.I) The kernel K is bounded and continuous and has compact support. Also,
p1+1(K), p2+1(K) = 0.
(AS.II) The design densities f, f1 and f2 are bounded, continuous and differen-tiable, have compact support and f(x1, x2) > 0 for all (x1, x2) supp(f). Thefirst derivatives of f1 and f2 have a finite number of sign changes over their
support.
(AS.III) The weight function r is bounded, continuous and differentiable, and
r(x1, x2) > 0 for all (x1, x2) supp(f). For any fixed x1, r(x1, x2)/x2 has afinite number of sign changes over supp(f), and similarly with both variables
interchanged. Also, the first derivatives of r1 and r2 have a finite number of
sign changes over supp(f).
(AS.IV) The additive functions m1, m2 are continuous and differentiable up to
order p1 + 1, p2 + 2, respectively, over supp(f).
(AS.V) The variance function v is continuous and v(x1, x2) > 0 for all (x1, x2) supp(f).
(AS.VI) As n , h1, h2 0 and nh1/ log(n), nh2/ log(n) .
Assumption (AS.III) on the conditional weight functions is rather technical innature, but should be satisfied by any reasonable weight function used in the lo-
cal scoring context. The remaining assumptions are the same as in Opsomer and
Ruppert (1997). The following result generalizes Lemmas 3.13.2 of Opsomer and
Ruppert (1997) and is proven in the Appendix.
Lemma 3.1 Under Assumptions (AS.I)-(AS.III), the following asymptotic approx-
imations hold uniformly over all elements of the matrices:
S
1
= S1
11T/n + o(11T/n) a.s.
S1S2 = T
12 + o(11
T/n) a.s.
where T12 is a matrix whose ijth element is
[T12]ij =1
n
f(X1i, X2j)
f1(X1i)f2(X2j)
r(X1i, X2j)
r1(X1i)r2(X2j)r(X1j, X2j) 1
n
8
-
7/27/2019 Local Scoring 1
11/20
The approximation for T12 simplifies to that given in Opsomer and Ruppert
(1997), Lemma 3.1 under equal weighting. In Lemma 3.2, Opsomer and Ruppert
(1997) provide sufficient conditions on the joint distribution ofX1i and X2i to ensure
that the additive model estimator is asymptotically unique. Because asymptotic
uniqueness of the additive model estimators depends on the invertability of (IT12),it follows directly from Lemma 3.1 that in the weighted case, both the distribution of
the Xi and the weight function will have an effect on the existence of the estimators
through the spectral radius of T12 (see Remark 3.1 in Opsomer and Ruppert (1997)
for details). Developing sufficient conditions guaranteeing asymptotic uniqueness in
the weighted case would therefore be very cumbersome and not very useful, since in
practice they cannot be checked. We will therefore make an additional assumption
guaranteeing invertibility:(AS.VII) There exists a matrix norm , such that I T12 < 1.This assumption and the uniform convergence results in Lemma 3.1 are sufficient to
prove that Lemma 3.2 in Opsomer and Ruppert (1997) holds for weighted additive
models. In particular, this guarantees that the estimators exist for sufficiently large
n and that backfitting converges to a unique solution.
Additional notation is needed before stating the next result. Let Dp representthe pth derivative operator, and let
Dpm =
dpm(X1)dxp
...dpm(Xn)
dxp
for any function m(). The main results of this section is stated in the followingtheorem, proven in the Appendix.
Theorem 3.2 Suppose that assumptions (AS.I)(AS.VII) hold and that the local
polynomials are of odd degreep1, p2. At the observation points(X1i, X2i), i = 1, . . . , n,
the conditional bias and variance of m1(X
1i) can be approximated by
E(m1(X1i) m1(X1i)|X1, X2) =1
p1 + 1!hp1+11 p1+1(K)
eTi (I T12)1Dp1+1m1 E(m(p1+1)1 (X1))
1
p2 + 1!hp2+12 p2+1(K)
eTi (I T12)1M2 E(m(p2+1)2 (X2))
+Op
1n
+ op(h
p1+11 + h
p2+12 ),
9
-
7/27/2019 Local Scoring 1
12/20
with
M2 =
E(m(p2+1)2 (X2)r(X1,X2)|X1=X11)
r1(X11)...
E(m(p2+1)2 (X2)r(X1,X2)|X1i=X1n)
r1(X1n)
,
and
Var(m1(Xi)|X1, X2) = 1nh1
R(K)
f1(X1i)
E(v(X1, X2)r(X1, X2)2|X1 = X1i)
r1(X1i)2+ op
1
nh1
.
As in Opsomer and Ruppert (1997), it is also possible to derive the properties of
the estimator for the additive mean function E(Y|X1, X2) = + m1(X1i) + m2(X2i).The bias of that estimator is the sum of the bias terms for m1(X1i) and m2(X2i),after removing the Op(1/
n) terms, and its variance is the sum of their variances
(see Opsomer and Ruppert (1997), Theorem 4.1 for details).
The theorem shows that the weighted additive model estimator is consistent and
has a bias of order Op(hp1+11 + h
p2+12 ), so that it has the same rates of convergence
as one-dimensional smoothing. However, the bias expression in Theorem 3.2 is quite
complicated in the weighted case, and it is not clear whether the bias is smaller or
larger than in the unweighted case.
The use of the weight function r also results in a variance of the same order as
the unweighted case, with a more complicated leading term. If the errors have a
constant variance v 2, then it is easy to see that the use of non-equal weights willincrease the asymptotic variance, since
E(r(X1, X2)2|X1 = X1i)
r1(X1i)2 1
for any weight function r. This is in marked contrast to the result in Theorem 3.1
for univariate weighted local polynomial regression, where the effect of weighting is
asymptotically negligible. If the variance function is known, however, then weighting
might still be a good idea, as the following corollary makes more precise.
Corollary 3.1 Suppose that r(X1, X2) = v(X1, X2)1 for all (X1, X2) supp(f).
Then, the variance of the additive model estimator is approximated by
Var(m1(X1i)|X1, X2) = 1nh1
R(K)
f1(X1i)E(v(X1, X2)|X1 = X1i) + op
1
nh1
.
10
-
7/27/2019 Local Scoring 1
13/20
This corollary shows that, if it is possible to weigh the observations by their
true variance function, then the effect of weighting on the asymptotic variance is
negligible. Note that the asymptotic bias in Theorem 3.2 remains affected by the
weights even in that case.
4 Local Scoring Estimators
We apply the results from Section 3 to the local scoring estimators for generalized
additive models. At any iteration t of the local scoring algorithm in Figure 1, the
weighted additive model provides consistent estimators for the additive model
E(zi) = + m1(X1i) + m2(X2i)
with weights
r(X1i, X2i) = g(i)
2V(i)1
where i = g1( + m1(X1i) + m2(X2i)) as calculated in the previous iteration t 1.
In general, it will not be possible to check that all the conditions on r stated in
(AS.III) hold for every r. If g and V are known functions of = E(Y|X1, X2), it ispossible to check whether the conditions hold for r(X1, X2) = g
()2V()1. For the
right choice of bandwidth and reasonably well-behaved data, however, it is reasonable
to assume that r(X1, X2) will then also be a continuous, differentiable function oversupp(f) with the technical smoothness properties mentioned in (AS.III).
The local scoring algorithm of Hastie and Tibshirani (1990) is difficult to analyze
for general linear smoothers. For generalized additive models fitted using local poly-
nomial regression, local scoring does not explicitly solve a set of score equations as in
the generalized linear model case (e.g. McCullagh and Nelder (1989), p.41). There-
fore, there is no explicit expression for what the Newton-Raphson steps converge to.
This was one of the main reasons for proposing the local likelihood estimation for
GAM in Kauermann and Opsomer (2002) as an easier-to-study alternative to local
scoring. Nevertheless, we can look at two simple cases to gain some insight in the
behavior of the local scoring algorithm.
First, it is easy to show that if the model has an identity link (i = i), the
local scoring algorithm in Figure 1, using the starting values proposed by Hastie and
Tibshirani (1990), is equivalent to a weighted additive model. Hence, in this case
only one iteration of the outer loop in Figure 1 is performed and the asymptotic
properties of the estimators are given in Theorem 3.2.
11
-
7/27/2019 Local Scoring 1
14/20
Second, consider the hypothetical one-step estimate where the true values of
the unknown quantities are used as starting values. This corresponds to the approach
used when an iterative algorithm solves explicit equations, since in that case the one-
step estimate using the true values is asymptotically equivalent to the fully iterated
solution (see Serfling, 1980, p.258). This will be the case if we can assume that
the local scoring estimator is consistent, for instance. Aerts et al. (2002) use this
approach for generalized additive model estimation with penalized regression splines.
At the true values, we have i = g1( + m1(X1i) + m2(X2i)) and
zi = + m1(X1i) + m2(X2i) + (Yi i)g(i).
Hence, E(zi) = + m1(X1i) + m2(X2i) and Var(zi) = g(i)
2V(i) = r(X1i, X2i)1,
so that Theorem 3.2 and Corollary 3.1 again apply directly to this case. If the weightfunction is not correctly specified, only Theorem 3.2 applies.
This implies that if we have starting values for i that are close to the true
values, the Hastie and Tibshirani (1990) local scoring algorithm should provide es-
timators with desirable statistical properties similar to those of additive models,
including asymptotic unbiasedness and one-dimensional nonparametric regression
convergence rates. These results generalize to the D-dimensional case, as done for
additive models in Opsomer (2000).
5 Conclusions
In the article, we have describe the asymptotic properties of additive models and
generalized additive models in the presence of observation weights. We have shown
that, unlike in univariate nonparametric regression, observation weights can poten-
tially inflate the variance and modify the bias in the additive model. The effect on
the asymptotic variance can be avoided, but only if the weights correspond to the
inverse of the variance of the model errors. Hence, if the weights are not variance-
related, for instance when the weighting comes from sampling design considerations,
the resulting estimator will have a larger variance then the unweighted estimator
(this is analogous to what happens with weighted least squares (WLS) estimators
in parametric linear regression). Overall, while the weights indeed affect the leading
terms of both the asymptotic bias and variance, they do not change the convergence
rates of the estimators.
12
-
7/27/2019 Local Scoring 1
15/20
We have discussed some of the implications of these findings for the widely used
local scoring estimators. In particular, if the model (and its variance) is well-specified
and the local scoring estimators are consistent, the effect of the weights is asymp-
totically negligible. Because of the iterative nature of the estimator, it is difficult to
prove this rigorously, however.
One important implication of this article is that effect of the weights cannot be
ignored in additive and generalized additive models. In addition to exploring some of
the consequences of the weights in this article, the results proven here will be helpful
for researchers working on these models. For instance, Kauermann and Opsomer
(2002) have used the results on weighted additive models in deriving the asymptotic
properties of local likelihood estimators for generalized additive models.
A Proofs
Proof of Theorem 3.1: To simplify notation, we will prove the theorem for the case
p = 1. The method of proof is entirely analogous to that of Ruppert and Wand
(1994), Theorem 2.1, and it can be generalized to arbitrary p following the approach
in their Theorem 4.1.
First, note that the bias can be written as
E(m(x)m(x)|X) = 12
eT1 (XTx R1/2KxR1/2Xx)1XTx RT/2KxR1/2(Qm(x)+Bm(x)),
with Qm(x) = ((X1 x)2, . . . , (Xn x)2)T h2m(x) and Bm(x) a vector of Taylorseries remainder terms. The latter is of smaller order than the terms in Qm(x),
provided this is non-zero. Now,
(1
nXTx R
1/2KxR1/2Xx)
1 =
(r(x)f(x))1 + op(1) D(rf)(x)(r(x)f(x))2 + op(1)
D(rf)(x)(r(x)f(x))2
+ op(1) (2(K)r(x)f(x)h
2
)
1
+ op(h
2
)
and
1
nXTx R
1/2KxR1/2Qm(x) = h
2
2(K)r(x)f(x) + op(1)
h24(K)D(rf)(x) + op(h2)
m(x),
so that
E(m(x) m(x)|X) = 12
h22(K)m(x) + op(h
2).
13
-
7/27/2019 Local Scoring 1
16/20
For the variance, we need to approximate
1
nXTx R
1/2KxR1/2V R1/2KxR
1/2Xx =
1
hr(x)2f(x)v(x)
R(K) h1(K
2)
h1(K2) h22(K
2)
(1 + op(1)),
so that
Var(m(x)|X) = 1nh
R(K)v(x)f(x)1(1 + op(1)).
Proof of Lemma 3.1: The proof will follow the same approach as that in Opsomer
and Ruppert (1997) Lemma 3.1. We begin by showing that the statements in the
lemma hold in probability, using the approximation
[S1]ij =1
nh1r1(X1i)
1f1(X1i)1K
X1j X1i
h1
r(X1j, X2j)(1 + op(1))
For the first statement, the reasoning is completely analogous to that in Opsomer
and Ruppert (1997) Lemma 3.1. For the second statement,
[S1S2]ij =1
n2h1h2
r(X1j, X2j)
r1(X1i)f1(X1i)
1(1 + op(1))
n
k=1
r(X1k, X2k)
r2(X2k)f2(X2k)
1KX1k X1i
h1 KX2k X2j
h2 =
1
n
f(X1i, X2j)
f1(X1i)f2(X2j)
r(X1i, X2j)
r1(X1i)r2(X2j)r(X1j , X2j)(1 + op(1))
The second statement in the lemma then holds in probability, from this approxima-
tion and the first statement.
Because of the additional assumptions on the weight function in (AS.III), the
same approach as in the proof of Opsomer and Ruppert (1997) Lemma 3.1 can be
followed to prove the uniform convergence.
Proof of Theorem3.2: The proof will follow the approach used in proving Theorem
4.1 in Opsomer and Ruppert (1997), and for simplicity, we consider only the case p1 =
p2 = 1. We let Q1 = (sT1,X1
Qm1(X1), . . . , sT1,Xn
Qm1(Xn))T and Q1 = (I11T/n)Q1,
with analogous definitions holding for Q2 and Q2. It follows directly from Theorem
3.1 and equation (5) that
E(m1 m1) = 12
(I S2S1)1(Q1 S1Q2) + Op
1n
+ op(h
21 + h
22)
14
-
7/27/2019 Local Scoring 1
17/20
with the Op(1/
n) term due to the presence of the model intercept. Q1 can be
approximated as in the proof of Theorem 3.1 and shown to be asymptotically un-
affected by the presence of the weights. Unlike in the unweighted case, however, a
similar calculation for S1Q2 will involve terms of the form
sT1,x1D2m2 =
E(m2(X2)r(X1, X2)|X1 = x1)r1(x1)
+ op(1),
so that
S1Q2 = 2(K)h22M
2 + op(h
22).
Hence, the conditional bias of m1 can be approximated by
E(m1 m1|X1, X2) =1
2 h
2
12(K)
(I T
12)
1
D2
m1 E(m
1(X1))
12
h222(K)
(I T12)1M2 E(m2(X2))
+ Op
1n
+ op(h
21 + h
22).
For the variance approximation, we start from the exact variance
Var(m1(Xi)|X1, X2) = eTi W1V WT1 ei
with V = diag{v(X11, X21), . . . , v(X1n, X2n)}. Following the approach in the proofof Theorem 4.1 in Opsomer and Ruppert (1997), the leading variance term can be
shown to be eTi S1V ST1 ei. Using the appraoch from proof of Theorem 3.1 again toapproximate this term, we find
Var(m1(X1i)|X1, X2) = 1nh1
R(K)
f1(X1i)
E(v(X1, X2)r(X1, X2)2|X1 = X1i)
r1(X1i)2+ op
1
nh1
.
References
Aerts, M., G. Claeskens, and M. Wand (2002). Some theory for penalized spline
generalized additive models. Journal of Statistical Planning and Inference 103,
455470.
Ansley, C. F. and R. Kohn (1994). Convergence of the backfitting algorithm for
additive models. Journal of the Australian Mathematical Society (Series A) 57,
316329.
15
-
7/27/2019 Local Scoring 1
18/20
Bio, A., R. Alkemade, and A. Barendregt (1998). Determining alternative mod-
els for vegetation response analysis: a non-parametric approach. Journal of
Vegetation Science 9, 516.
Buja, A., T. J. Hastie, and R. J. Tibshirani (1989). Linear smoothers and additive
models. Annals of Statistics 17, 453555.
Burman, P. (1990). Estimation of generalized additive models. Journal of Multi-
variate Analysis 32, 230255.
Couper, D. and M. S. Pepe (1997). Modelling prevalence of a condition: Chronic
graft-versus-host disease after bone marrow transplantation. Statistics in
Medicine 16, 15511571.
Figueiras, A. and C. Cadarso-Surez (2001). Application of nonparametric modelsfor calculating odds ratios and their confidence intervals for continuous expo-
sures. American Journal of Epidemiology 154 (3), 264275.
Fricker, Ronald D., J. and N. W. Hengartner (2001). Environmental equity and
the distribution of toxic release inventory and other environmentally unde-
sirable sites in metropolitan New York City. Environmental and Ecological
Statistics 8(1), 3352.
Friedman, J. H. and W. Stuetzle (1981). Projection pursuit regression. Journal of
the American Statistical Association 76, 817823.
Gu, C., D. M. Bates, Z. Chen, and G. Wahba (1989). The computation of gcv
functions through householder tridiagonalization with application to the fitting
of interaction spline models. SIAM Journal of Matrix Analysis Applications 10,
457480.
Hardle, W. and P. Hall (1993). On the backfitting algorithm for additive regression
models. Statistica Neerlandica 47, 4357.
Hastie, T. J. and R. J. Tibshirani (1990). Generalized Additive Models. Washing-
ton, D. C.: Chapman and Hall.
Kauermann, G. and J. D. Opsomer (2002). Local likelihood estimation in gener-
alized additive models. To appear in Scandinavian Journal of Statistics.
Linton, O. B. (2000). Efficient estimation of generalized additive nonparametric
regression models. Econometric Theory 16, 502523.
16
-
7/27/2019 Local Scoring 1
19/20
Mammen, E., O. Linton, and J. Nielsen (1999). The existence and asymptotic
properties of a backfitting projection algorithm under weak conditions. Annals
of Statistics 27, 14431490.
McCullagh, P. and J. A. Nelder (1989). Generalized Linear Models (2 ed.). London:
Chapman and Hall.
Nelder, J. A. and R. W. M. Wedderburn (1972). Generalized linear models. Journal
of the Royal Statistical Association, Series A 135, 370384.
Opsomer, J. D. (2000). Asymptotic properties of backfitting estimators. Journal
of Multivariate Analysis 73, 166179.
Opsomer, J.-D. and D. Ruppert (1997). Fitting a bivariate additive model by local
polynomial regression. Annals of Statistics 25, 186211.
Rothery, P. and D. B. Roy (2001). Application of generalized additive models to
butterfly transect count data. Journal of Applied Statistics 28(7), 897909.
Ruppert, D. and M. P. Wand (1994). Multivariate locally weighted least squares
regression. Annals of Statistics 22, 13461370.
Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. New
York: John Wiley & Sons.
Sperlich, S., O. Linton, and W. Hardle (1999). Integration and backfitting methods
in additive models finite sample properties and comparison. Test 8, 419459.
Stone, C. J. (1985). Additive regression and other nonparametric models. Annals
of Statistics 13, 689705.
Stone, C. J. (1986). The dimensionality reduction principle for generalized additive
models. Annals of Statistics 14, 590606.
Wahba, G. (1986). Partial and interaction splines for the semiparametric esti-
mation of functions of several variables. In T. J. Boardman (Ed.), Computer
Science and Statistics: Proceedings of the 18th Symposium on the Interface,pp. 7580. American Statistical Association.
17
-
7/27/2019 Local Scoring 1
20/20
1. Initialize: t = 0, = g(y), mt1 = . . . = mtD = 0, with
mtd = (mtd(Xd1), . . . , m
td(Xdn))
T.
2. Update:
(a) Transformation/reweighing: construct an adjusted dependent variable
zi = ti + (yi ti)
ii
t
, i = 1, . . . , n
with ti = +D
d=1 mtd(Xdi) and ti = g1(ti), and the weights
wi =
ii
2t
(Vti )1 , i = 1, . . . , n .
(b) Backfitting: fit the weighted additive model to z = (z1, . . . , z n)T to obtain
estimated functions mt+1d () through backfittingi. Initialize: s = 0, = z, msd = m
td, d = 1, . . . , D
ii. Update:
ms+11 = S1(z d=1
msd)
...
ms+1D = SD(z d=D
msd)
and set s = s + 1.
iii. Repeat step ii until the estimated functions do not change, and set
m
t+1
d = ms+1
d , d = 1, . . . , D, and t = t + 1.
3. Repeat step 2 until the estimated functions do not change
Figure 1: Local scoring algorithm (Hastie and Tibshirani (1990), p.141).
18