local scoring 1

Upload: yaronbaba

Post on 02-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Local Scoring 1

    1/20

    A Note on Local Scoring and Weighted Local

    Polynomial Regression in Generalized Additive

    Models1

    J.D. Opsomer

    Iowa State University2Goran Kauermann

    University of Glasgow3

    November 14, 2002

    1Running title: Weighted Generalized Additive Models.2Department of Statistics, Ames, IA 50011, USA; email: [email protected] of Statistics & Robertson Centre, Boyd Orr Building, Glasgow G12 8QQ,

    UK.

  • 7/27/2019 Local Scoring 1

    2/20

    Abstract

    This article describes the asymptotic properties of local polynomial regression esti-mators for univariate and additive models when observation weights are included.

    Such weighted additive models are a crucial component of local scoring, the widely

    used estimation algorithm for generalized additive models described in Hastie and

    Tibshirani (1990). The statistical properties of the univariate local polynomial es-

    timator are shown to be asymptotically unaffected by the weights. In contrast, the

    weights inflate the asymptotic variance of the additive model estimators. The impli-

    cations of these findings for the local scoring estimators are discussed.

    Key words: backfitting, additive model, GAM.

  • 7/27/2019 Local Scoring 1

    3/20

    1 Introduction

    Additive models and generalized additive models are popular multivariate nonpara-

    metric regression techniques, widely used by statisticians and other scientists. Whilea number of different fitting methods are available for these models, the most popular

    one is backfitting, an iterative algorithm proposed by Friedman and Stuetzle (1981).

    For generalized additive models (GAM), the backfitting iteration is performed by

    transforming the original observations to scores and fitting those in an iterative and

    weighted manner using backfitting. The overall procedure is called local scoring

    (Hastie and Tibshirani, 1990, p.140), which generalizes the Fisher scoringprocedure

    described in McCullagh and Nelder (1989, p.42) . This procedure is implemented in

    the gam() routine in Splus and frequently used in practice.

    Backfitting and local scoring break the multivariate regression into a sequence

    of univariate regressions, which are much easier to compute. Unlike in unrestricted

    multi-dimensional smoothing, the resulting one-dimensional additive fits for each

    of the covariates are easily displayed and interpreted. The ease of calculation and

    interpretation has made these techniques widely used in statistical data exploration

    and analysis. Recent applications of GAM fitting in the statistical literature include

    Couper and Pepe (1997), Bio et al. (1998), Figueiras and Cadarso-Surez (2001),

    Fricker and Hengartner (2001), Rothery and Roy (2001), and a large and growing

    number of researchers use GAM as an exploratory tool in their day-to-day statisticalpractice.

    The study of the statistical properties of these estimators is complicated by their

    implicit definition as the convergence point of an iterative algorithm. For unweighted

    backfitting estimators, equivalent explicit definitions are available and can sometimes

    be used to derive the properties of the estimators. When the univariate smoothing

    methods used in backfitting are projections onto specific subspaces (for instance,

    parametric regression, regression and smoothing splines), it is possible to write the

    overall additive model estimator as a projection. The estimator can then be studied

    using this equivalent formulation. See Stone (1985), Wahba (1986), Gu et al. (1989),

    Hardle and Hall (1993), Ansley and Kohn (1994) and Mammen et al. (1999) for

    results using this approach.

    Alternatively, it is possible to rewrite the backfitting estimator as the solution

    of a large system of linear equations. Opsomer and Ruppert (1997) use this alter-

    native definition to derive asymptotic mean squared error properties of the estima-

    1

  • 7/27/2019 Local Scoring 1

    4/20

    tors for bivariate additive models using local polynomial regression, a widely used

    non-projection smoother, and Opsomer (2000) generalizes these results to models of

    higher dimension. In this paper, we study the asymptotic properties of the back-

    fitting estimator for additive models in the presence of observation weights. These

    weights can be included in a regression model to account for heteroskedasticity, or

    they can reflect the sampling design used for collecting the data. As we will show,

    observation weights have an effect on both the asymptotic bias and variance, so that

    users of additive models should be aware of it.

    Another reason for studying the effect of observation weights in the context of

    additive model is that these weights form an integral part of the local scoring algo-

    rithm itself. In local scoring for generalized additive models, the backfitting step is

    performed repeatedly using observation weights, and these weights are updated ateach iteration of the algorithm. Hence, local scoring is essentially iterative fitting of

    additive models using weighted observations. The weights depend on the mean value

    of the response and are updated iteratively. An outline of the local scoring algorithm

    is provided in Figure 1, which is further explained in Section 2.

    As for unweighted backfitting, some results on local scoring are available when

    projection-type smoothers are used. Stone (1986) and Burman (1990) study these

    estimators in the context of regression splines. Gu et al. (1989) use smoothing

    splines within a Bayesian framework to find an asymptotic posterior distribution for

    the estimators. Recently, Aerts et al. (2002) describe some theoretical results for

    generalized additive models fitted with penalized regression splines.

    If non-projection smoothers such as kernel-based methods are used, it is necessary

    to consider the sequence of weighted additive models fits explicitly. To our knowledge,

    that has not yet been done. The results of the current paper are a step towards the

    study of the local scoring algorithm, and can also be used by other authors interested

    in that estimator. In Kauermann and Opsomer (2002), we study the properties of

    a local likelihood estimator which is closely related to the local scoring estimator,

    and we rely heavily on the weighted additive model results that are described in thecurrent article.

    A different approach to fitting GAMs has been explored by Linton (2000), who

    replaces the backfitting component of local scoring by marginal integration, a non-

    recursive method. While promising, the results from this research are not applicable

    to the much more widely used local scoring estimators and will not be further dis-

    cussed here. We refer however to Sperlich et al. (1999) for a comparison of backfitting

    2

  • 7/27/2019 Local Scoring 1

    5/20

    and integration estimators in additive models.

    The outline of the paper is as follows. Section 2 introduces the statistical model

    that is studied in this article and reviews the local scoring estimator. In Section

    3, we derive the asymptotic results for weighted local polynomial regression and for

    weighted additive models. Section 4 describes the implications of the previous section

    for local scoring estimators.

    2 The statistical model

    In generalized additive models, the response variable Y is assumed to have an ex-

    ponential family distribution with conditional mean = E(Y|X1, . . . , X D), which islinked to the predictors via

    g() = + m1(X1) + . . . + mD(XD), (1)

    where g() is a known, invertible function. If g() is the identity link and the errorsfollow a continuous density, this reduces to an additive model. Just as the additive

    model can be considered an extension of linear regression, GAM can be thought of as

    a nonparametric equivalent of the generalized linear modelof Nelder and Wedderburn

    (1972). This class of models is quite broad and includes not only the additive model

    itself but also nonparametric extensions to proportional-hazards, logit, log-linearregression and numerous other models.

    The local scoring algorithm for fitting generalized additive models combines a

    generalization of Fisher scoring, as commonly used for generalized linear models

    (McCullagh and Nelder, 1989), with backfitting, an iterative procedure that reduces

    the D-dimensional fitting problem to a sequence of 1-dimensional ones (Buja et al.

    1989). Since it will be referred to frequently, we outline the local scoring algorithm

    in Figure 1. In this figure, i = g(i) represents the (unknown) additive function

    of the predictors for the ith observation (see expression (1)), Vti is an estimate of

    Var(Yi) at iteration t (typically, Vti is a known function of ti), and S1 . . . , SD are

    the smoother matrices corresponding to the nonparametric regression method used

    for fitting the model (see Section 3).

    In the next section, we begin by studying the situation where g() is the identifyfunction and the weights are a smooth function of the covariates. After introducing

    a weighted univariate smoother and deriving its basic properties, we will study the

    weighted backfitting estimator.

    3

  • 7/27/2019 Local Scoring 1

    6/20

    3 Weighted Additive Models

    3.1 Weighted Local Polynomial Regression

    In order to perform the weighted additive model step of the local scoring algorithm,

    a weighted nonparametric smoother needs to be defined. We will focus here on local

    polynomial regression, a popular smoothing technique, and define a weighted version

    for use in generalized additive models. Hastie and Tibshirani (1990, p.7274) discuss

    approaches to include weights into the various nonparametric regression techniques.

    For local polynomial regression, they propose to multiply the observation weights

    and the kernel weights, and use them in the weighted least-squares fit. We will

    follow that approach here. For simplicity, we will only provide asymptotic results

    for the case when the degree of the local polynomial is odd. This covers local linearsmoothing, the most commonly used kernel-based regression method.

    Suppose that the data (Xi, Yi), i = 1, . . . , n is generated by the following model:

    Yi = m(Xi) + v(Xi)1/2i (2)

    where m() and v() are continuous, unknown functions over the support of Xi andthe i are independent and identically distributed random variables with mean 0 and

    variance 1. Let X = (X1, . . . , X n)T and Y = (Y1, . . . , Y n)

    T. Let K represent a kernel

    function and h the corresponding bandwidth parameter, and let r(Xi), i = 1, . . . , nrepresent a set of observation weights, assumed here to be a function ofXi. Ifv were

    known, an obvious choice for the weight function would be r() = v()1, but weconsider more general weight functions. The weighted local polynomial regression

    estimator of degree p at a location x, written as m(x), is defined as the solution for

    0 to the weighted mean squared error minimization centered at x,

    min0,...,p

    ni=1

    K

    Xi x

    h

    r(Xi) (Yi 0 1(Xi x) . . . p(Xi x)p)2 .

    To obtain a nonparametric estimator for the function m(), this minimization isrepeated for every x at which a fit is needed. The solution to the minimization can

    be written down explicitly as

    m(x) = sTx Y = eT1 (X

    Tx R

    1/2KxR1/2Xx)

    1XTx R1/2KxR

    1/2Y, (3)

    with ei a vector with a one in the ith position and zeros elsewhere, the matrix Kx

    4

  • 7/27/2019 Local Scoring 1

    7/20

    = diag{ 1h

    K(X1xh

    ), . . . , 1h

    K(Xnxh

    )}, R = diag{r(Xi), . . . , r(Xn)},

    Xx =

    1 (X1 x) (X1 x)p...

    .

    .... .

    .

    ..1 (Xn x) (Xn x)p

    ,

    (see also Ruppert and Wand, 1994). Since R1/2 and Kx are diagonal matrices,

    R1/2KxR1/2 = RKx = KxR. We prefer the first notation because it generalizes

    readily to non-diagonal weight matrices and simplifies the matrix algebra in the

    proofs.

    We introduce some additional notation and assumptions before stating the asymp-

    totic bias and variance results for m(x). Let f(x) represent the density of Xi. For

    the kernel function K, write the moments of K as j(K) =

    uj

    K(u)du for any jand let R(K) =

    K(u)2du. We make the following technical assumptions:

    (AS.I) The kernel K is bounded and continuous and has compact support. Also,

    p+1(K) = 0.

    (AS.II) The density f is bounded, continuous and differentiable, has compact sup-

    port and fX(x) > 0 for all x [ax, bx] = supp(f).

    (AS.III) The weight function r is bounded, continuous and differentiable, and

    r(x) > 0 for all x [ax, bx].(AS.IV) The mean function m is continuous and differentiable up to order p + 1

    over (ax, bx).

    (AS.V) The variance function v is continuous and v(x) > 0 for all x [ax, bx].

    (AS.VI) As n , h 0 and nh .The following result is proven in the Appendix.

    Theorem 3.1 For local polynomial fitting of degree p, for p > 0 and odd, the con-ditional bias and variance of m(x) for x (ax, bx) can be approximated by:

    E(m(x) m(x)|X) = 1(p + 1)!

    hp+1p+1(K)m(p+1)(x) + op(h

    p+1).

    and

    Var(m(x)|X) = 1nh

    R(K)v(x)f(x)1(1 + op(1)).

    5

  • 7/27/2019 Local Scoring 1

    8/20

    As discussed in Remark 1 of Ruppert and Wand (1994), the leading term of the

    bias and variance does not depend on X. Nevertheless, this is still a conditional

    result, because in general the unconditional asymptotic bias and variance are not

    guaranteed to exist.

    Theorem 3.1 implies that, asymptotically, the inclusion of observation weights

    that are a smooth function of the covariate has no effect on the bias and variance

    of the local polynomial regression estimator. As we will show below, however, that

    is no longer true when local polynomial regression is used within an additive model

    context.

    3.2 Additive Models Using Weighted Smoothers

    We now consider data generated by the additive model

    Yi = + m1(X1i) + . . . + mD(XDi) + v(X1i, . . . , X Di)1/2i,

    where v() is a continuous, bounded function. We are interested in the heteroskedasticcase since this is typically the situation used in the local scoring algorithm. Back-

    fitting estimators for the additive model are usually defined as the solution of the

    backfitting algorithm at convergence. The backfitting algorithm is shown in step

    2(b) in Figure 1.

    An equivalent definition for these estimators is to view them as the solutions to

    the following set of estimating equations:

    I S1 S1S2 I S2

    ......

    . . ....

    SD SD . . . I

    m1

    m2...

    mD

    =

    S1

    S2...

    SD

    Y, (4)

    where S1, . . . , SD are n n smoother matrices for the D covariates. In the case oflocal polynomial regression, if sd,x represents the n 1 smoother vector that mapsthe vector of observations Y to its nonparametric mean function estimate at a point

    x for the dth covariate (as equation (3) did in the univariate case), then

    Sd =

    sd,Xd1...

    sd,Xdn

    .

    6

  • 7/27/2019 Local Scoring 1

    9/20

    Equation (3) provides the explicit expression for the smoother vectors for weighted

    local polynomial regression.

    Expression (4) represents a system of nD equations in nD unknowns and is

    solved through backfitting, but it is also possible, at least conceptually, to write the

    estimators directly as

    m1

    m2...

    mD

    =

    I S1 S1S2 I S2

    ......

    . . ....

    SD SD . . . I

    1

    S1

    S2...

    SD

    Y M1CY,

    with the matrices M and C defined in this equation.

    For most smoothing methods in practice, including local polynomial regression,the matrix M as written here is not invertible. The smoothing matrices Sd need

    to be replaced by the centered smoothing matrices Sd = (I 11T/n)Sd for alld = 1, . . . , D, where 1 is an n 1 vector of ones. The invertibility of the matrix Mcomposed of centered smoothers is discussed in Buja et al. (1989) and, for the case of

    unweighted local polynomial fitting, in Opsomer and Ruppert (1997) and (Opsomer,

    2000).

    For simplicity, we will focus here on the case with D = 2 and the local polynomials

    are of odd degree. The main results generalize to D > 2 using the recursive approach

    of Opsomer (2000), but the expressions become much more complicated. Buja et al.

    (1989) give explicit expressions for the md when D = 2:

    m1 = {I (I S1S2)1(I S1)}Y W1Ym2 = {I (I S2S1)1(I S2)}Y W2Y, (5)

    provided the inverses exist. Since these direct expressions correspond to the so-

    lutions of the backfitting algorithm at convergence, it is possible to derive many of

    the properties of backfitting estimators from them. In particular, the convergence of

    the algorithm in Figure 1-2(b) and the uniqueness of the estimators (5) both follow

    directly from the existence of the inverse of M.

    The results from Opsomer and Ruppert (1997) will be generalized to the situation

    in which a set of observation weights r(Xi), i = 1, . . . , n are used in the local poly-

    nomial regression. Let r1(x1) = E(r(Xi)|X1i = x1) and r1(x2) = E(r(Xi)|X2i = x2),the conditional univariate weight functions. The following assumptions are made:

    7

  • 7/27/2019 Local Scoring 1

    10/20

    (AS.I) The kernel K is bounded and continuous and has compact support. Also,

    p1+1(K), p2+1(K) = 0.

    (AS.II) The design densities f, f1 and f2 are bounded, continuous and differen-tiable, have compact support and f(x1, x2) > 0 for all (x1, x2) supp(f). Thefirst derivatives of f1 and f2 have a finite number of sign changes over their

    support.

    (AS.III) The weight function r is bounded, continuous and differentiable, and

    r(x1, x2) > 0 for all (x1, x2) supp(f). For any fixed x1, r(x1, x2)/x2 has afinite number of sign changes over supp(f), and similarly with both variables

    interchanged. Also, the first derivatives of r1 and r2 have a finite number of

    sign changes over supp(f).

    (AS.IV) The additive functions m1, m2 are continuous and differentiable up to

    order p1 + 1, p2 + 2, respectively, over supp(f).

    (AS.V) The variance function v is continuous and v(x1, x2) > 0 for all (x1, x2) supp(f).

    (AS.VI) As n , h1, h2 0 and nh1/ log(n), nh2/ log(n) .

    Assumption (AS.III) on the conditional weight functions is rather technical innature, but should be satisfied by any reasonable weight function used in the lo-

    cal scoring context. The remaining assumptions are the same as in Opsomer and

    Ruppert (1997). The following result generalizes Lemmas 3.13.2 of Opsomer and

    Ruppert (1997) and is proven in the Appendix.

    Lemma 3.1 Under Assumptions (AS.I)-(AS.III), the following asymptotic approx-

    imations hold uniformly over all elements of the matrices:

    S

    1

    = S1

    11T/n + o(11T/n) a.s.

    S1S2 = T

    12 + o(11

    T/n) a.s.

    where T12 is a matrix whose ijth element is

    [T12]ij =1

    n

    f(X1i, X2j)

    f1(X1i)f2(X2j)

    r(X1i, X2j)

    r1(X1i)r2(X2j)r(X1j, X2j) 1

    n

    8

  • 7/27/2019 Local Scoring 1

    11/20

    The approximation for T12 simplifies to that given in Opsomer and Ruppert

    (1997), Lemma 3.1 under equal weighting. In Lemma 3.2, Opsomer and Ruppert

    (1997) provide sufficient conditions on the joint distribution ofX1i and X2i to ensure

    that the additive model estimator is asymptotically unique. Because asymptotic

    uniqueness of the additive model estimators depends on the invertability of (IT12),it follows directly from Lemma 3.1 that in the weighted case, both the distribution of

    the Xi and the weight function will have an effect on the existence of the estimators

    through the spectral radius of T12 (see Remark 3.1 in Opsomer and Ruppert (1997)

    for details). Developing sufficient conditions guaranteeing asymptotic uniqueness in

    the weighted case would therefore be very cumbersome and not very useful, since in

    practice they cannot be checked. We will therefore make an additional assumption

    guaranteeing invertibility:(AS.VII) There exists a matrix norm , such that I T12 < 1.This assumption and the uniform convergence results in Lemma 3.1 are sufficient to

    prove that Lemma 3.2 in Opsomer and Ruppert (1997) holds for weighted additive

    models. In particular, this guarantees that the estimators exist for sufficiently large

    n and that backfitting converges to a unique solution.

    Additional notation is needed before stating the next result. Let Dp representthe pth derivative operator, and let

    Dpm =

    dpm(X1)dxp

    ...dpm(Xn)

    dxp

    for any function m(). The main results of this section is stated in the followingtheorem, proven in the Appendix.

    Theorem 3.2 Suppose that assumptions (AS.I)(AS.VII) hold and that the local

    polynomials are of odd degreep1, p2. At the observation points(X1i, X2i), i = 1, . . . , n,

    the conditional bias and variance of m1(X

    1i) can be approximated by

    E(m1(X1i) m1(X1i)|X1, X2) =1

    p1 + 1!hp1+11 p1+1(K)

    eTi (I T12)1Dp1+1m1 E(m(p1+1)1 (X1))

    1

    p2 + 1!hp2+12 p2+1(K)

    eTi (I T12)1M2 E(m(p2+1)2 (X2))

    +Op

    1n

    + op(h

    p1+11 + h

    p2+12 ),

    9

  • 7/27/2019 Local Scoring 1

    12/20

    with

    M2 =

    E(m(p2+1)2 (X2)r(X1,X2)|X1=X11)

    r1(X11)...

    E(m(p2+1)2 (X2)r(X1,X2)|X1i=X1n)

    r1(X1n)

    ,

    and

    Var(m1(Xi)|X1, X2) = 1nh1

    R(K)

    f1(X1i)

    E(v(X1, X2)r(X1, X2)2|X1 = X1i)

    r1(X1i)2+ op

    1

    nh1

    .

    As in Opsomer and Ruppert (1997), it is also possible to derive the properties of

    the estimator for the additive mean function E(Y|X1, X2) = + m1(X1i) + m2(X2i).The bias of that estimator is the sum of the bias terms for m1(X1i) and m2(X2i),after removing the Op(1/

    n) terms, and its variance is the sum of their variances

    (see Opsomer and Ruppert (1997), Theorem 4.1 for details).

    The theorem shows that the weighted additive model estimator is consistent and

    has a bias of order Op(hp1+11 + h

    p2+12 ), so that it has the same rates of convergence

    as one-dimensional smoothing. However, the bias expression in Theorem 3.2 is quite

    complicated in the weighted case, and it is not clear whether the bias is smaller or

    larger than in the unweighted case.

    The use of the weight function r also results in a variance of the same order as

    the unweighted case, with a more complicated leading term. If the errors have a

    constant variance v 2, then it is easy to see that the use of non-equal weights willincrease the asymptotic variance, since

    E(r(X1, X2)2|X1 = X1i)

    r1(X1i)2 1

    for any weight function r. This is in marked contrast to the result in Theorem 3.1

    for univariate weighted local polynomial regression, where the effect of weighting is

    asymptotically negligible. If the variance function is known, however, then weighting

    might still be a good idea, as the following corollary makes more precise.

    Corollary 3.1 Suppose that r(X1, X2) = v(X1, X2)1 for all (X1, X2) supp(f).

    Then, the variance of the additive model estimator is approximated by

    Var(m1(X1i)|X1, X2) = 1nh1

    R(K)

    f1(X1i)E(v(X1, X2)|X1 = X1i) + op

    1

    nh1

    .

    10

  • 7/27/2019 Local Scoring 1

    13/20

    This corollary shows that, if it is possible to weigh the observations by their

    true variance function, then the effect of weighting on the asymptotic variance is

    negligible. Note that the asymptotic bias in Theorem 3.2 remains affected by the

    weights even in that case.

    4 Local Scoring Estimators

    We apply the results from Section 3 to the local scoring estimators for generalized

    additive models. At any iteration t of the local scoring algorithm in Figure 1, the

    weighted additive model provides consistent estimators for the additive model

    E(zi) = + m1(X1i) + m2(X2i)

    with weights

    r(X1i, X2i) = g(i)

    2V(i)1

    where i = g1( + m1(X1i) + m2(X2i)) as calculated in the previous iteration t 1.

    In general, it will not be possible to check that all the conditions on r stated in

    (AS.III) hold for every r. If g and V are known functions of = E(Y|X1, X2), it ispossible to check whether the conditions hold for r(X1, X2) = g

    ()2V()1. For the

    right choice of bandwidth and reasonably well-behaved data, however, it is reasonable

    to assume that r(X1, X2) will then also be a continuous, differentiable function oversupp(f) with the technical smoothness properties mentioned in (AS.III).

    The local scoring algorithm of Hastie and Tibshirani (1990) is difficult to analyze

    for general linear smoothers. For generalized additive models fitted using local poly-

    nomial regression, local scoring does not explicitly solve a set of score equations as in

    the generalized linear model case (e.g. McCullagh and Nelder (1989), p.41). There-

    fore, there is no explicit expression for what the Newton-Raphson steps converge to.

    This was one of the main reasons for proposing the local likelihood estimation for

    GAM in Kauermann and Opsomer (2002) as an easier-to-study alternative to local

    scoring. Nevertheless, we can look at two simple cases to gain some insight in the

    behavior of the local scoring algorithm.

    First, it is easy to show that if the model has an identity link (i = i), the

    local scoring algorithm in Figure 1, using the starting values proposed by Hastie and

    Tibshirani (1990), is equivalent to a weighted additive model. Hence, in this case

    only one iteration of the outer loop in Figure 1 is performed and the asymptotic

    properties of the estimators are given in Theorem 3.2.

    11

  • 7/27/2019 Local Scoring 1

    14/20

    Second, consider the hypothetical one-step estimate where the true values of

    the unknown quantities are used as starting values. This corresponds to the approach

    used when an iterative algorithm solves explicit equations, since in that case the one-

    step estimate using the true values is asymptotically equivalent to the fully iterated

    solution (see Serfling, 1980, p.258). This will be the case if we can assume that

    the local scoring estimator is consistent, for instance. Aerts et al. (2002) use this

    approach for generalized additive model estimation with penalized regression splines.

    At the true values, we have i = g1( + m1(X1i) + m2(X2i)) and

    zi = + m1(X1i) + m2(X2i) + (Yi i)g(i).

    Hence, E(zi) = + m1(X1i) + m2(X2i) and Var(zi) = g(i)

    2V(i) = r(X1i, X2i)1,

    so that Theorem 3.2 and Corollary 3.1 again apply directly to this case. If the weightfunction is not correctly specified, only Theorem 3.2 applies.

    This implies that if we have starting values for i that are close to the true

    values, the Hastie and Tibshirani (1990) local scoring algorithm should provide es-

    timators with desirable statistical properties similar to those of additive models,

    including asymptotic unbiasedness and one-dimensional nonparametric regression

    convergence rates. These results generalize to the D-dimensional case, as done for

    additive models in Opsomer (2000).

    5 Conclusions

    In the article, we have describe the asymptotic properties of additive models and

    generalized additive models in the presence of observation weights. We have shown

    that, unlike in univariate nonparametric regression, observation weights can poten-

    tially inflate the variance and modify the bias in the additive model. The effect on

    the asymptotic variance can be avoided, but only if the weights correspond to the

    inverse of the variance of the model errors. Hence, if the weights are not variance-

    related, for instance when the weighting comes from sampling design considerations,

    the resulting estimator will have a larger variance then the unweighted estimator

    (this is analogous to what happens with weighted least squares (WLS) estimators

    in parametric linear regression). Overall, while the weights indeed affect the leading

    terms of both the asymptotic bias and variance, they do not change the convergence

    rates of the estimators.

    12

  • 7/27/2019 Local Scoring 1

    15/20

    We have discussed some of the implications of these findings for the widely used

    local scoring estimators. In particular, if the model (and its variance) is well-specified

    and the local scoring estimators are consistent, the effect of the weights is asymp-

    totically negligible. Because of the iterative nature of the estimator, it is difficult to

    prove this rigorously, however.

    One important implication of this article is that effect of the weights cannot be

    ignored in additive and generalized additive models. In addition to exploring some of

    the consequences of the weights in this article, the results proven here will be helpful

    for researchers working on these models. For instance, Kauermann and Opsomer

    (2002) have used the results on weighted additive models in deriving the asymptotic

    properties of local likelihood estimators for generalized additive models.

    A Proofs

    Proof of Theorem 3.1: To simplify notation, we will prove the theorem for the case

    p = 1. The method of proof is entirely analogous to that of Ruppert and Wand

    (1994), Theorem 2.1, and it can be generalized to arbitrary p following the approach

    in their Theorem 4.1.

    First, note that the bias can be written as

    E(m(x)m(x)|X) = 12

    eT1 (XTx R1/2KxR1/2Xx)1XTx RT/2KxR1/2(Qm(x)+Bm(x)),

    with Qm(x) = ((X1 x)2, . . . , (Xn x)2)T h2m(x) and Bm(x) a vector of Taylorseries remainder terms. The latter is of smaller order than the terms in Qm(x),

    provided this is non-zero. Now,

    (1

    nXTx R

    1/2KxR1/2Xx)

    1 =

    (r(x)f(x))1 + op(1) D(rf)(x)(r(x)f(x))2 + op(1)

    D(rf)(x)(r(x)f(x))2

    + op(1) (2(K)r(x)f(x)h

    2

    )

    1

    + op(h

    2

    )

    and

    1

    nXTx R

    1/2KxR1/2Qm(x) = h

    2

    2(K)r(x)f(x) + op(1)

    h24(K)D(rf)(x) + op(h2)

    m(x),

    so that

    E(m(x) m(x)|X) = 12

    h22(K)m(x) + op(h

    2).

    13

  • 7/27/2019 Local Scoring 1

    16/20

    For the variance, we need to approximate

    1

    nXTx R

    1/2KxR1/2V R1/2KxR

    1/2Xx =

    1

    hr(x)2f(x)v(x)

    R(K) h1(K

    2)

    h1(K2) h22(K

    2)

    (1 + op(1)),

    so that

    Var(m(x)|X) = 1nh

    R(K)v(x)f(x)1(1 + op(1)).

    Proof of Lemma 3.1: The proof will follow the same approach as that in Opsomer

    and Ruppert (1997) Lemma 3.1. We begin by showing that the statements in the

    lemma hold in probability, using the approximation

    [S1]ij =1

    nh1r1(X1i)

    1f1(X1i)1K

    X1j X1i

    h1

    r(X1j, X2j)(1 + op(1))

    For the first statement, the reasoning is completely analogous to that in Opsomer

    and Ruppert (1997) Lemma 3.1. For the second statement,

    [S1S2]ij =1

    n2h1h2

    r(X1j, X2j)

    r1(X1i)f1(X1i)

    1(1 + op(1))

    n

    k=1

    r(X1k, X2k)

    r2(X2k)f2(X2k)

    1KX1k X1i

    h1 KX2k X2j

    h2 =

    1

    n

    f(X1i, X2j)

    f1(X1i)f2(X2j)

    r(X1i, X2j)

    r1(X1i)r2(X2j)r(X1j , X2j)(1 + op(1))

    The second statement in the lemma then holds in probability, from this approxima-

    tion and the first statement.

    Because of the additional assumptions on the weight function in (AS.III), the

    same approach as in the proof of Opsomer and Ruppert (1997) Lemma 3.1 can be

    followed to prove the uniform convergence.

    Proof of Theorem3.2: The proof will follow the approach used in proving Theorem

    4.1 in Opsomer and Ruppert (1997), and for simplicity, we consider only the case p1 =

    p2 = 1. We let Q1 = (sT1,X1

    Qm1(X1), . . . , sT1,Xn

    Qm1(Xn))T and Q1 = (I11T/n)Q1,

    with analogous definitions holding for Q2 and Q2. It follows directly from Theorem

    3.1 and equation (5) that

    E(m1 m1) = 12

    (I S2S1)1(Q1 S1Q2) + Op

    1n

    + op(h

    21 + h

    22)

    14

  • 7/27/2019 Local Scoring 1

    17/20

    with the Op(1/

    n) term due to the presence of the model intercept. Q1 can be

    approximated as in the proof of Theorem 3.1 and shown to be asymptotically un-

    affected by the presence of the weights. Unlike in the unweighted case, however, a

    similar calculation for S1Q2 will involve terms of the form

    sT1,x1D2m2 =

    E(m2(X2)r(X1, X2)|X1 = x1)r1(x1)

    + op(1),

    so that

    S1Q2 = 2(K)h22M

    2 + op(h

    22).

    Hence, the conditional bias of m1 can be approximated by

    E(m1 m1|X1, X2) =1

    2 h

    2

    12(K)

    (I T

    12)

    1

    D2

    m1 E(m

    1(X1))

    12

    h222(K)

    (I T12)1M2 E(m2(X2))

    + Op

    1n

    + op(h

    21 + h

    22).

    For the variance approximation, we start from the exact variance

    Var(m1(Xi)|X1, X2) = eTi W1V WT1 ei

    with V = diag{v(X11, X21), . . . , v(X1n, X2n)}. Following the approach in the proofof Theorem 4.1 in Opsomer and Ruppert (1997), the leading variance term can be

    shown to be eTi S1V ST1 ei. Using the appraoch from proof of Theorem 3.1 again toapproximate this term, we find

    Var(m1(X1i)|X1, X2) = 1nh1

    R(K)

    f1(X1i)

    E(v(X1, X2)r(X1, X2)2|X1 = X1i)

    r1(X1i)2+ op

    1

    nh1

    .

    References

    Aerts, M., G. Claeskens, and M. Wand (2002). Some theory for penalized spline

    generalized additive models. Journal of Statistical Planning and Inference 103,

    455470.

    Ansley, C. F. and R. Kohn (1994). Convergence of the backfitting algorithm for

    additive models. Journal of the Australian Mathematical Society (Series A) 57,

    316329.

    15

  • 7/27/2019 Local Scoring 1

    18/20

    Bio, A., R. Alkemade, and A. Barendregt (1998). Determining alternative mod-

    els for vegetation response analysis: a non-parametric approach. Journal of

    Vegetation Science 9, 516.

    Buja, A., T. J. Hastie, and R. J. Tibshirani (1989). Linear smoothers and additive

    models. Annals of Statistics 17, 453555.

    Burman, P. (1990). Estimation of generalized additive models. Journal of Multi-

    variate Analysis 32, 230255.

    Couper, D. and M. S. Pepe (1997). Modelling prevalence of a condition: Chronic

    graft-versus-host disease after bone marrow transplantation. Statistics in

    Medicine 16, 15511571.

    Figueiras, A. and C. Cadarso-Surez (2001). Application of nonparametric modelsfor calculating odds ratios and their confidence intervals for continuous expo-

    sures. American Journal of Epidemiology 154 (3), 264275.

    Fricker, Ronald D., J. and N. W. Hengartner (2001). Environmental equity and

    the distribution of toxic release inventory and other environmentally unde-

    sirable sites in metropolitan New York City. Environmental and Ecological

    Statistics 8(1), 3352.

    Friedman, J. H. and W. Stuetzle (1981). Projection pursuit regression. Journal of

    the American Statistical Association 76, 817823.

    Gu, C., D. M. Bates, Z. Chen, and G. Wahba (1989). The computation of gcv

    functions through householder tridiagonalization with application to the fitting

    of interaction spline models. SIAM Journal of Matrix Analysis Applications 10,

    457480.

    Hardle, W. and P. Hall (1993). On the backfitting algorithm for additive regression

    models. Statistica Neerlandica 47, 4357.

    Hastie, T. J. and R. J. Tibshirani (1990). Generalized Additive Models. Washing-

    ton, D. C.: Chapman and Hall.

    Kauermann, G. and J. D. Opsomer (2002). Local likelihood estimation in gener-

    alized additive models. To appear in Scandinavian Journal of Statistics.

    Linton, O. B. (2000). Efficient estimation of generalized additive nonparametric

    regression models. Econometric Theory 16, 502523.

    16

  • 7/27/2019 Local Scoring 1

    19/20

    Mammen, E., O. Linton, and J. Nielsen (1999). The existence and asymptotic

    properties of a backfitting projection algorithm under weak conditions. Annals

    of Statistics 27, 14431490.

    McCullagh, P. and J. A. Nelder (1989). Generalized Linear Models (2 ed.). London:

    Chapman and Hall.

    Nelder, J. A. and R. W. M. Wedderburn (1972). Generalized linear models. Journal

    of the Royal Statistical Association, Series A 135, 370384.

    Opsomer, J. D. (2000). Asymptotic properties of backfitting estimators. Journal

    of Multivariate Analysis 73, 166179.

    Opsomer, J.-D. and D. Ruppert (1997). Fitting a bivariate additive model by local

    polynomial regression. Annals of Statistics 25, 186211.

    Rothery, P. and D. B. Roy (2001). Application of generalized additive models to

    butterfly transect count data. Journal of Applied Statistics 28(7), 897909.

    Ruppert, D. and M. P. Wand (1994). Multivariate locally weighted least squares

    regression. Annals of Statistics 22, 13461370.

    Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. New

    York: John Wiley & Sons.

    Sperlich, S., O. Linton, and W. Hardle (1999). Integration and backfitting methods

    in additive models finite sample properties and comparison. Test 8, 419459.

    Stone, C. J. (1985). Additive regression and other nonparametric models. Annals

    of Statistics 13, 689705.

    Stone, C. J. (1986). The dimensionality reduction principle for generalized additive

    models. Annals of Statistics 14, 590606.

    Wahba, G. (1986). Partial and interaction splines for the semiparametric esti-

    mation of functions of several variables. In T. J. Boardman (Ed.), Computer

    Science and Statistics: Proceedings of the 18th Symposium on the Interface,pp. 7580. American Statistical Association.

    17

  • 7/27/2019 Local Scoring 1

    20/20

    1. Initialize: t = 0, = g(y), mt1 = . . . = mtD = 0, with

    mtd = (mtd(Xd1), . . . , m

    td(Xdn))

    T.

    2. Update:

    (a) Transformation/reweighing: construct an adjusted dependent variable

    zi = ti + (yi ti)

    ii

    t

    , i = 1, . . . , n

    with ti = +D

    d=1 mtd(Xdi) and ti = g1(ti), and the weights

    wi =

    ii

    2t

    (Vti )1 , i = 1, . . . , n .

    (b) Backfitting: fit the weighted additive model to z = (z1, . . . , z n)T to obtain

    estimated functions mt+1d () through backfittingi. Initialize: s = 0, = z, msd = m

    td, d = 1, . . . , D

    ii. Update:

    ms+11 = S1(z d=1

    msd)

    ...

    ms+1D = SD(z d=D

    msd)

    and set s = s + 1.

    iii. Repeat step ii until the estimated functions do not change, and set

    m

    t+1

    d = ms+1

    d , d = 1, . . . , D, and t = t + 1.

    3. Repeat step 2 until the estimated functions do not change

    Figure 1: Local scoring algorithm (Hastie and Tibshirani (1990), p.141).

    18