departamento de estad´ıstica e investigaci´on operativa...

202
Goodness–of–fit tests for regression models Wenceslao Gonz´ alez Manteiga, Rosa M. Crujeiras Casais Departamento de Estad´ ıstica e Investigaci´on Operativa Universidad de Santiago de Compostela

Upload: others

Post on 23-Aug-2020

33 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

Goodness–of–fit tests for regression models

Wenceslao Gonzalez Manteiga, Rosa M. Crujeiras Casais

Departamento de Estadıstica e Investigacion Operativa

Universidad de Santiago de Compostela

Page 2: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction. Motivation. The distribution case. Parametric models.

Tests based on the estimation of the regression function. An example for fixed design. An example for random design. The generalized likelihood ratio test. Tests based on the empirical distribution of the residuals Tests designed for avoiding the curse of dimensionality. Other approaches. Bootstrap approximations. Connections with the F test. Discussion about the power.

Tests based on the integrated regression function. The integrated regression function. The marked empirical process. Bootstrap approximations.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 3: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems. Testing the equality of regression curves. Testing partial linearity. Generalized linear regression models. Significance tests. Testing additivity. GOF for regressions models with incomplete data. Tests for time series. Tests for spatial data.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 4: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

Motivation

X: explanatory variable.Y : response random variable.Regression function: m (x) = E (Y |X=x)

Important question

Is the model m ∈ mθ; θ ∈ Θ enough well supported by the data?

See Seber (1977) for the linear regression case (mθ (·) = At (·) θ)and Seber and Wild(1989) for nonlinear mθ.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 5: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

Motivation

Polynomial regression

X is a one dimensional random variable.

At (x) =(1, x, x2, · · · , xq−1

)∈ R

q

θ = (θ1, θ2, · · · , θq)t ∈ Θ ⊂ Rq

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 6: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

Motivation

Multiple regression

X is a q-dimensional random variable.

At (x) = x ∈ Rq

θ ∈ Θ ⊂ Rq

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 7: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

Motivation

Some nonlinear regression model

X = (X1, X2)t.

θ = (θ1, θ2)t ∈ Θ = (s, t) ∈ R

2/s+ t 6= 0

mθ (x) =1

θ1 + θ2exp (θ1x1 + θ2x2)

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 8: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

Motivation

A simulated example:

0.0 0.2 0.4 0.6 0.8 1.0

−4

−2

02

x

y

truem0mh

Model:

Y = 2x2 − 5x+ cos(2πx) + ε

n = 500.

ε ∼ N (0, 1).

Null hypothesis:

H0 : m is linear.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 9: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

Motivation

A real dataset:

0.4 0.6 0.8 1.0 1.2 1.4 1.6

0.3

0.4

0.5

0.6

0.7

Income

Foo

d ex

pens

es

m0mh

235 observations.

Early eighties.

X: income

Y : expenditure on food for Belgianworking class households.

Null hypothesis:

H0 : m is linear.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 10: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

The distribution case

Strong analogy with the goodness-of-fit tests for the distributioncase.X: interest random variable with distribution function F .Main problem

Simple null hypothesis case:Test H0: F = F0.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 11: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

The distribution case

Distribution based tests

Based on the empirical process:

α (x) = n1/2 (Fn (x)− F0 (x))

= n−1/2n∑

i=1

(1Xi≤x − F0 (x)

),

where Fn is the empirical cdf.Kolmogorov-Smirnov test:

Tn = supx∈R

n1/2 |Fn (x)− F0 (x)| .

Cramer-von Mises test:

Tn =

∫(Fn (x)− F0 (x))

2 dF0 (x) .

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 12: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

The distribution case

Density based tests

Chi-squared test:

Tn =k∑

i=1

(Oi − Ei)2

Ei= n

k∑

i=1

[∫Ii

(fH − f0

)]2∫Iif0

≃ n

(∫fH 2

f0− 1

),

where fH is the histogram estimator of f .

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 13: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

The distribution case

Distance based tests

Tn = d(f , f0

)

f is a nonparametric estimator of f and d is a functional distance:

d (f, g) =∫|f − g| or d (f, g) =

[∫(f − g)2

]1/2, for instance.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 14: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

Parametric models

Composite null hypothesis case

Test H0: F ∈ F = Fθ/θ ∈ Θ, for some Θ ⊂ Rq.

All the approaches use some estimator θ of θ (MLE, minimum dis-tance, minimum chi-squared, ...)

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 15: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

Parametric models

Distribution based tests

Empirical process with estimated parameters:

α1 (x) = n1/2(Fn (x)− Fθ (x)

),

leading to a Kolmogorov-Smirnov test:

Tn = supx∈R

n1/2∣∣Fn (x)− Fθ (x)

∣∣ ,

or to a Cramer-von Mises test:

Tn =

∫ (Fn (x)− Fθ (x)

)2dFθ (x) .

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 16: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

Parametric models

Density based tests

Use fθ for the chi-squared test:

Tn = nk∑

i=1

[∫Ii

(fH − fθ

)]2∫Iifθ

≃ n

(∫fH 2

fθ− 1

)

Distance based tests

Tn = d(f , fθ

).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 17: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

Parametric models

A classical example:Density estimation

Waiting times

Den

sity

40 50 60 70 80 90 100

0.00

0.01

0.02

0.03

0.04

Old Faithful geyser dataset.

Waiting times between eruptions.

Kernel density estimator:

Gaussian kernel; CV bandwidth.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 18: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

Parametric models

Density estimation

x

Den

sity

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

optimaloversmoothingundersmoothing

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 19: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

Parametric models

Asymptotic structure:

h−1/2(nh d(f , fθ)−∫

fθ(x)ω(x)dx

∫K2(x)dx) →

→ N(0, 2

∫(K ∗K)2(x)dx

∫ω2(x)f2

θ (x)dx

).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 20: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Introduction

Parametric models

Some references

Bickel and Rosenblatt (1973)

Ahmad and Cerrito (1993)

Fan (1994,1998)

Gourieroux and Tenreiro (2001)

Neumann and Paparoditis (2000)

Lee and Na (2002)

Gine and Mason (2004)

Chebana (2004)

Cao and Lugosi (2005)

Bachmann and Dette (2005)

Chebana (2006)

Liang and King (2007)

Tenreiro (2007, 2009)W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 21: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Tests based on the estimationof the regression function

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 22: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Regression model

Yi = m (Xi) + εi, i = 1, . . . , n

Fixed design:Xi = xi (fixed points) for i = 1, 2, . . . , n.V ar (Yi) = V ar (εi) = σ2 (xi).

Random design:(Xi, Yi)ni=1 initial random sample from the (X,Y )population.V ar

(Yi|Xi

)= V ar

(εi|Xi

)= σ2 (Xi).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 23: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Basic problem

Test H0: m ∈ mθ; θ ∈ Θ.

Use some nonparametric estimator of m,

m (x) =n∑

j=1

Wnj (x)Yj

and compare it with some parametric estimator, mθ.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 24: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

For random design, the estimator proposed by Nadaraya(1964) and Watson (1964):

m(x) =

∑ni=1K

(x−Xi

h

)Yi

∑ni=1K

(x−Xi

h

) ,

where K is the kernel function and h is the bandwidthparameter.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 25: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

The local polynomial regression (see, for instance, Fan andGijbels (1996)), also quite popular in random design:

m(x) = β0(x) =n∑

j=1

Wn,q

(x−Xi

h

)Yi,

where β(x) = (β0(x), . . . , βq(x))t, is the minimizer of:

n∑

i=1

(Yi −

q∑

r=0

βr(x−Xi)r

)2

K

(x−Xi

h

),

and Wn,q = ut(XTWX)−1(1, ht, . . . , hqtq)K(t)h , with

ut = (1, 0, . . . , 0) ∈ Rq+1, X = ((x−Xi)

j)1≤i≤n,1≤j≤q,

W = diag(K(x−Xi

h

)).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 26: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

For the fixed design case, the Priestley-Chao estimator(Priestley and Chao (1972)):

m(x) =n∑

i=1

1

h

∫ si

si−1

K

(x− u

h

)duYi,

with s0 = 0, si−1 ≤ xi ≤ si, i = 1, . . . , n and sn = 1.

In general, all these estimators can be written as:

m(x) =n∑

i=1

Wni(x)Yi.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 27: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Let us consider the linear regression case

H0 : m = mθ (·) = At (·) θ, for some θ.

The least squares estimator is

θ =(AA

t)−1

AY,

where

Aq×n = (A (X1) A (X2) · · · A (Xn))

Yn×1 =

Y1Y2...Yn

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 28: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Approach 1: Smoothing the data.

d (m,H0) =1

n

n∑

i=1

(m (Xi)−mθ (Xi)

)2

=1

nRSS (HY,PY) .

Where

m (X1)m (X2)

...m (Xn)

= HY

with H = (Wnj (Xi))ni,j=1

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 29: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

and

mθ (X1)mθ (X2)

...mθ (Xn)

=

At (X1) θ

At (X2) θ...

At (Xn) θ

= A

= At(AA

t)−1

AY = PY,

withP = A

t(AA

t)−1

A.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 30: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Approach 2: Smoothing the data and the hypothesis.

d (m,H0) =1

n

n∑

i=1

(m (Xi)− mθ (Xi)

)2

=1

nRSS (HY,HPY) .

With

mθ (x) =n∑

j=1

Wnj (x)mθ (Xj)

and

mθ (X1)mθ (X2)

...mθ (Xn)

= HPY.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 31: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Some references

Eubank and Spiegelman (1990)

Raz (1990)

Kozek (1991)

Staniswalis and Severini (1991)

Muller (1992)

Hart and Wehrly (1992)

Eubank and Hart (1992)

Wooldridge (1992)

Eubank and Hart (1993)

Eubank and LaRiccia (1993)

Gonzalez-Manteiga and Cao (1993)

Hardle and Mammen (1993)

Samarov (1993)W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 32: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

and more...

Fan and Li (1996)

Spokoiny (1996)

Stute and Gonzalez-Manteiga (1996)

Hardle, Spokoiny and Sperlich (1997)

Hart (1997)

Hardle, Mammen and Muller (1998)

Rodrıguez-Campos, Gonzalez-Manteiga and Cao (1998)

Alcala, Cristobal and Gonzalez-Manteiga (1999)

Dette (1999)

Hardle and Kneip (1999)

Aerts, Claeskens and Hart (2000)

Ramil-Novo and Gonzalez-Manteiga (2000)

Biederman, S. and Dette, H. (2001)

Fan, Zhang, C. and Zhang, J. (2001)W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 33: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

and a few more...

Fan and Linton (2002)

Miles and Mora (2003)

Zhang , C. and Dette (2003)

Koul, H. and Ni, P. (2004)

Xia, Tong and Zhang(2004)

Zhang, C. (2003) and (2004)

Eubank, Li and Wang (2005)

Guerre and Lavergne (2005)

Hall and Yatchew (2005)

Zhu (2005)

Fan, Li and Min (2006)

Gao (2007)

Hsiao, Li and Racine (2007)

Koul and Song (2008)W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 34: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

An example for fixed design.

As in Gonzalez-Manteiga and Cao (1993), consider the case oftesting a linear regression (mθ (·) = At (·) θ), assuminghomoscedasticity (E (εi) = σ2).xi =

in , i = 1, 2, . . . , n; fixed equispaced design in [0, 1].

mh (x) =∑n

i=1Wni (x)Yi, with Gasser-Muller weights:

Wni (x) = h−1

∫ si

si−1

K

(x− s

h

)ds,

where s0 = 0, si−1 ≤ xi ≤ si, i = 1, 2, . . . , n and sn = 1.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 35: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

An example for fixed design.

Assume that H0 holds and call θ0 the true value of the parameter.The test statistic is based on ∆ASE = ASEnonpar −ASEpar,where

ASEpar =1

n

n∑

i=1

(mθ0 (xi)−mθ (xi)

)2,

ASEnonpar =1

n

n∑

i=1

(mθ0 (xi)− mh (xi))2

and θ is choosen by minimum distance:

θ := argmınθ∈Θ

n∑

i=1

(mh (xi)−mθ (xi))2 .

This leads to

∆ASE =1

n

n∑

i=1

(mθ (xi)− mh (xi)

)2.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 36: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

An example for fixed design.

The test statistic:

Tn =

(2σ4

∫(K ∗K)2

)−1/2

×((

n2h)1/2

∆ASE − h−1/2σ2

∫K2

)

≃(2 trace

(H

tH)2)−1/2

×(σ−2RSS (HY,PHY)− trace

(H

tH))

.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 37: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

An example for fixed design.

Under H0: m (x) = At (x) θ, (nh4 −→ 0, nh2 −→ ∞, regularityand moment conditions),

Tnd−→ N (0, 1) .

So, reject H0 if Tn > zα.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 38: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

An example for fixed design.

Under local alternatives of the form

m (x) = At (x) θ + cng (x) ,

with g (x) orthogonal to At (x) θ and cn = n−1/2h−1/4,

Tnd−→ N

∫g2

(2∫(K ∗K)2

)1/2σ2

, 1

.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 39: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

An example for fixed design.

An alternative approximation by Ramil-Novo andGonzalez-Manteiga (1998) is

supx

∣∣∣P (Tn ≤ x)− P((2v)−1/2 (χ2

v − v)≤ x

)∣∣∣ ≤ c · h,

c is a constant and v =(trace(H4))

3

(trace(H6))2.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 40: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

An example for fixed design.

Correlated errors

Generally speaking, the theory remains the same replacing σ2 by∑∞k=−∞ γ (k). See Gonzalez-Manteiga and Vilar-Fernandez (1995,

1996, 2000, 2004) and Biederman and Dette (2000), where γ(k) =Cov(εt, εt+k).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 41: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

An example for random design.

We follow the lines by Alcala, Cristobal and Gonzalez-Manteiga(1999) who test the polynomial hypothesis

H0 : m (x) =

q∑

j=1

θjxj−1.

X is in a compact set with probability one and its density, f , isbounded away from zero.m, f ∈ C2, w ∈ C.σ2 (x) = V ar (Y |X=x) is bounded, bounded away from zero andcontinuous. K is a compact support symmetric density.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 42: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

An example for random design.

m is the local polynomial estimator:

m (x) = β0 (x) =

n∑

i=1

Wn,q

(Xi − x

h

)Yi,

where β (x) =(β0 (x) , . . . , βq (x)

)tis the minimizer of

n∑

i=1

(Yi −

q∑

r=0

βr (x) (Xi − x)r)2

K

(Xi − x

h

),

and

Wn,q (t) = ut(XtWX

)−1 (1, ht, . . . , hqtq

)t K (t)

h,

with ut = (1, 0, . . . , 0) ∈ Rq+1, X =

((Xi − x)j

)1≤i≤n0≤j≤q

,

W =diag(K(Xi−x

h

)).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 43: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

An example for random design.

The test statistic is

Tn =

[2h−1

∫ (Kq ∗ Kq

)2 ∫ (σ2w

f

)2

dt

]−1/2

×(nd2

(m,mθ

)− h−1

∫K2

q

∫σ2w

f

),

with

d2(m,mθ

)=

∫ (m−mθ

)2w

that coincides with∫ (

m− mθ

)2w if q > q, and Kq is the

equivalent kernel corresponding to Wn,q.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 44: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

An example for random design.

Under H0,

Tnd−→ N (0, 1) .

For local alternatives

m (x) =

q∑

j=1

θjxj−1 + cng (x) ,

with g (x) orthogonal to∑q

j=1 θjxj−1 and cn = n−1/2h−1/4,

Tnd−→ N

∫g2w

(2∫ (

Kq ∗ Kq

)2dt

)1/2 ∫ (σ2wf

)2 , 1

.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 45: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

An example for random design.

Drawbacks of the smoothing approach

1. Bandwidth choice.

2. Slow rate of convergence of Tn to its normal limit.

3. Unknown curves involved in the test statistic requiresestimation.

The bootstrap approximation seems a reasonable alternative.See Hardle and Mammen (1993), Stute and Gonzalez-Manteiga(1996) and Alcala, Cristobal and Gonzalez-Manteiga (1999).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 46: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

The generalized likelihood ratio test.

Consider the model

Yi = m(Xi) + εi, i = 1, . . . , n

where εi is a sequence of i.i.d. N (0, σ2) random variables andthe Xi have density support in [0, 1]. Assume that:

M = m ∈ L2[0, 1];

∫(m(k)(x))2dx ≤ c.

Set the testing problem:

H0 : m(x) = θ0 + θ1x vs. Ha : m(x) 6= θ0 + θ1x.

The loglikelihood associated with the previous model is given by:

l(m,σ) = −n log(√2πσ2)− 1

2σ2

n∑

i=1

(Yi −m(Xi))2.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 47: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

The generalized likelihood ratio test.

Denote by θ0 and θ1 the maximum likelihood estimators (MLE)under H0 and mMLE the MLE under M. This estimator is the onethat minimizes:

n∑

i=1

(Yi −m(Xi))2 subject to

∫(m(k)(x))2dx ≤ c.

Then, mMLE is the smoothing spline with smoothing parameter

such that ‖m(k)MLE‖2 = c. Therefore,

RSS0 =n∑

i=1

(Yi − θ0 − θ1Xi)2,

RSS1 =

n∑

i=1

(Yi − mMLE(Xi))2.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 48: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

The generalized likelihood ratio test.

Besides,

λn = l(mMLE , σ)− l(m0, σ0) =n

2log

RSS0

RSS1,

with σ2 = RSS1/n, σ2 = RSS0/n and m0(x) = θ0 + θ1x.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 49: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

The generalized likelihood ratio test.

Although in this particular situation, the MLE exists under Mmodels, the constant c is unknown and in many situations, mMLE

may not exist.The generalized likelihood ratio tests (GLRT) considers anestimator under M which may not coincide with the MLE, forinstance, the local linear fit mh. In this way:

l(mh, σ) = −n

2log(RSS1)−

n

2

(1 + log

n

),

l(m0, σ0) = −n

2log(RSS0)−

n

2

(1 + log

n

),

and the GLRT statistic is given by:

λn = l(mh, σ)− l(m0, σ0) =n

2log

RSS0

RSS1,

where

RSS1 =n∑

i=1

(Yi − mh(Xi))2.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 50: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

The generalized likelihood ratio test.

Under some regularity conditions, in Fan et al. (2001) it is provedthat:

rkλn ∼ χ2νn , νn =

rkck|Ω|h

where |Ω| is the measure of the support of X, rk = ck/dk,ck = K(0)− 1

2‖K‖2 and dk = ‖K − 12K ∗K‖2. In Fan et al.

(2001) and Fan and Jiang (2007), these ideas are applied to morecomplex contexts.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 51: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Tests based on the empirical distribution of the residuals.

Location–scale regression model

Assume that the regression model can be written in a location–scaleform as

Y = m(X) + σ(X)ε,

with ε independent of X and with error distribution Fε(y) = P(ε ≤y) = P ((Y −m(X))|σ(X) ≤ y).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 52: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Tests based on the empirical distribution of the residuals.

If θ0 denotes the argument that minimizes E((m(X)−mθ(X))2)over the parameter set Θ ⊂ R

q, then mθ0is the parametric model

with minimum distance to m, and the error distribution under thismodel is built as

Fε0(y) = P(ε0 ≤ y) = P

((Y −mθ0

(X))|σ(X) ≤ y).

Hence, the null hypothesis H0 : m ∈ Mθ is true if and only if theerror distributions Fε and Fε0 are the same.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 53: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Tests based on the empirical distribution of the residuals.

The process

This result opens a way for GoF considering continuous functionalsof the process Fε(·) − Fε0(·), where the estimators of the errordistribution can be given by

Fε(y) =1

n

n∑

i=1

I

(Yi −mnh(Xi)

σ(Xi)≤ y

)=

1

n

n∑

i=1

I(εi ≤ y)

and

Fε0(y) =1

n

n∑

i=1

I

(Yi −m

θ(Xi)

σ(Xi)≤ y

)=

1

n

n∑

i=1

I(εi0 ≤ y)

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 54: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Tests based on the empirical distribution of the residuals.

And the variance estimator is given by

σ2(x) =

n∑

i=1

Wni(x)Y2i −m2

nh(x)

being Wnini=1 a sequence of Nadaraya–Watson weights and θ aleast squares estimator.See Van Keilegom et al. (2008) and Khmaladze and Koul (2009)for p = 1 (one dimensional covariate) and Neumeyer (2009) andNeumeyer and Van Keilegom (2010) for p ≥ 1.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 55: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Tests based on the empirical distribution of the residuals.

Based on the empirical distribution of the residuals, theKolmogorov–Smirnov and Cramer–von–Mises tests are given by:

TnKS = n1/2 supy∈R

|Fε(y)−Fε0(y)|, and TnCM = n

∫(Fε(y)−Fε0(y))

2d

From this methodology, a test for the error distribution can be alsoconstructed, without further assumptions on m and σ, justcomparing the empirical distribution of the residuals εini=1 withthe one estimated under H0 : Fε ∈ Fθ.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 56: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Tests designed for avoiding the curse of dimensionality.

A great deal of the theory developed during the nineties, considerstests statistics constructed from the comparison of anonparametric estimator of the regression model and an estimatorunder the null hypothesis (that is, based on the αn process), or inthe corresponding comparison of the integrated regression functionestimators (based on the αn process). In both cases, the curse ofdimensionality as p increases, being p the dimension of theexplanatory variable, can be appreciated.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 57: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Tests designed for avoiding the curse of dimensionality.

The difficulties aforementioned lead to different modifications ofthe previous methods in order to avoid the curse of dimensionality.For the tests based on smoothing methods, corresponding toprocess αn, the works by Lavergne and Patilea (2008) and Xia(2009) should be noticed.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 58: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Tests designed for avoiding the curse of dimensionality.

Inspired on the projection pursuit ideas, the null hypothesisH0 : m ∈ Mθ is true if and only if m = mθ0 ∈ Mθ, and this isalso equivalent to E(ε|X) = E(ε0|X) = E(Y −mθ0(X)|X) = 0.In addition, this is also equivalent to:

supβ, ‖β‖=1

supν

|E(ε|βtX = ν)| = 0 ⇔ supβ, ‖β‖=1

E(εE(ε|βtX)) = 0

under some regularity conditions, and this allows for theconstruction of some tests, similar to Zheng’s test (see Lavergneand Patilea, 2008) which adapted to this context is given by:

Tn = supβ, ‖β‖=1

i<j

Kh(βt(Xi −Xj))(Yi −m

θ(Xi))(Yj −m

θ(Xj)).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 59: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Tests designed for avoiding the curse of dimensionality.

Another interesting idea consists in projecting the covariate X inthe direction of β = β0 such that this β0 (with ‖β0‖ = 1)minimizes

E2(ε− E(ε|βtX)) = E

2(ε−mβ(X)),

the single–indexing procedure obtained through the correspondingempirical counterparts (see Xia, 2009).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 60: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Tests designed for avoiding the curse of dimensionality.

This enables to construct test statistics such as

Tn =1

n

n∑

i=1

ω(Xj)(εj0 − m

βj(βt

jXj))2

where

βj = arg mınβ, ‖β‖=1

i 6=j

(εi0 − mj

β(Xi))2

, j = 1, . . . , n

being

mjβ(x) =

1

nf jβ(Xj)

i 6=j

Kh(βt(x−Xi))εi0,

f jβ(x) =

1

n

i 6=j

Kh(βt(x−Xi)).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 61: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Tests designed for avoiding the curse of dimensionality.

Regarding the tests based on empirical regression processes, inStute et al. (2008), the authors propose replacing the empiricalprocess αn by

αng(t) = n−1/2

n∑

i=1

(g(Xi)− g)I(εi0 ≤ t), t ∈ R

indexed unidimensionally in t, with g = n−1∑n

i=1 g(Xi). The keyfor the adequate behaviour of the tests based on αn

g(t) lies in the

fourth term of the asymptotic representation (see Stute et al.,2008).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 62: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Tests designed for avoiding the curse of dimensionality.

Under the assumption that ε is independent of X in the regressionmodel, this term is given by the empirical counterpart of

A = E

[(g(X)− E(g(X)))H(t,X, θ0)

]

with H(t, x, θ) = P (ε ≤ t+mθ(X)−m(X)|X = x) and θ0defined in the previous section. If the null hypothesis H0 : m ∈ Mθ

does not hold, then A 6= 0, guaranteeing the power of the test forfixed alternatives. The selection of the function g is also discussedin Stute et al. (2008) with the goal of maximizing power.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 63: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Other approaches

During the nineties, different alternative proposals were introducedin the statistical literature, assuming in general,

H0 : M = mθ(·).Hardle and Mammen (1993) introduce a test based on:

d1(m,H0) =

∫ (m(x)− mθ(x)

)2π1(x)dx,

where mθ(x) is the local polynomial regression function from(Xi,mθ(Xi))ni=1.Note that using the Riemann approximation of the integral formθ(·) = A(·)tθ, we obtain the ∆ASE theory described in theprevious section.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 64: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Other approaches

The idea of this test is based on

C1 = E2 (ε0|X)π1(X), ε0 = Y −mθ0(X),

since E(C1) = 0 if and only if H0 holds.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 65: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Other approaches

Zheng’s test (1996) considers:

d2(m,H0) =1

n

i 6=j

Kh(Xi−Xj)(Yi−mθ(Xi))(Yj−mθ(Xj))π2(Xi).

This test is based on

C2 = ε0E (ε0|X) f(X)π2(X).

The expected value of C2 is zero if an only if the null hypothesisholds. The sample version of E(C2) is given by:

1

n

n∑

i=1

(Yi −mθ(Xi)

)(mh(Xi)− mθ(Xi))fh(Xi)π2(Xi),

which coincides with d2(m,H0)/n apart from an additive constantterm.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 66: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Other approaches

The tests proposed in Dette (1999) (see also Azzalini and Bowman(1993)) are based in statistics of this type:

d3(m,H0) =n∑

i=1

(Yi −mθ(Xi)

)2π3(Xi)−

n∑

i=1

(Yi − mh(Xi))2 π3(Xi).

This test is based on

C3 = E(ε20 − (ε0 − E(ε0|X))2

)π3(X).

The expected value of this quantity is zero if and only if H0 holds.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 67: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Other approaches

The sample version of E(C3) is given by:

1

n

n∑

i=1

((Yi −mθ(Xi))

2 − ((Yi −mθ(Xi))− (mh(Xi)− mθ(Xi)))2)π3(Xi)

=1

n

n∑

i=1

((Yi −mθ(Xi))

2 − (Yi − mh(Xi)−mθ(Xi) + mθ(Xi))2)π3(Xi),

which coincides with d3(m,H0)/n when mθ(Xi) = mθ(Xi).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 68: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Other approaches

This is true when mθ(·) is a polynomial of order q, with p ≥ q,where p denotes the order of the local polynomial fit.Note that if π1 ≡ π2f ≡ π3 ≡ c, for c > 0, then

E(C1) = E(C2) = E(C3) = c(E(E2(ε0|X))

).

In Zhang and Dette (2004), the authors provide an exhaustivestudy comparing these procedures, under the null hypothesis andcontiguous and fixed alternatives.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 69: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Bootstrap approximations.

Naive resampling of the residuals

Step 1. Consider the parametric residuals

ei = Yi −mθ (Xi) , i = 1, 2, . . . , n

Step 2. Recenter the residuals:

ei = ei − e, i = 1, 2, . . . , n

with e =1

n

n∑

i=1

ei.

Step 3. Draw the bootstrap errors, ε∗i , from the empirical distributionfunction of the ei = ei − e, i = 1, 2, . . . , n.

Step 4. Define the bootstrap resample:

Y ∗i = mθ (Xi) + ε∗i , i = 1, 2, . . . , n.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 70: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Bootstrap approximations.

Step 5. Using the resample (Xi, Y∗i )ni=1 compute m∗, mθ∗ the

distance between them, d2(m∗,mθ∗

), and the bootstrap

version, T ∗n .

Step 6. Repeat steps 3-5 B times and compute c∗1−α, the

⌈B (1− α)⌉-th order statistic of the T∗(j)n , j = 1, 2, . . . , B.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 71: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Bootstrap approximations.

Wild resampling

Step 1. Construct the parametric residuals

ei = Yi −mθ (Xi) , i = 1, 2, . . . , n

Step 2. Draw the bootstrap errors, ε∗i , satisfying E∗ (ε∗i ) = 0,E∗(ε∗2i)= e2i , E

∗(ε∗3i)= e3i , i = 1, 2, . . . , n.

Step 3. The bootstrap resample is defined as before

Y ∗i = mθ (Xi) + ε∗i , i = 1, 2, . . . , n.

Step 4. The resample (Xi, Y∗i )ni=1 is used to compute T ∗

n , as for theprevious proceeding.

Step 5. After repeating B times the previous steps, the⌈B (1− α)⌉-th percentile of the T ∗

n -values is used as a level αcritical value of the test.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 72: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Connections with the F test and with the log-likelihood ratio test.

In the fixed design case, under the assumption of normal errors, atypical test statistic for a particular null parametric hypothesis, H0,versus some general parametric alternative, H1, is

ln(θ)− ln(θ) = cnS(θ)− S

(θ)

S(θ) d

= Fm1,m2

with

S(θ)=

n∑

i=1

(Yi −mθ (xi)

)2, S

(θ)=

n∑

i=1

(Yi −mθ (xi)

)2,

where θ and θ are estimators of the true parameter under H0 andH1, respectively and ln is the conditional log-likelihood function.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 73: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Connections with the F test and with the log-likelihood ratio test.

Example

In the well known linear model, the hypothesis H0: Bθ = c (whereB(q−q1)×q) can be tested using the classical F -test

F = cnS(θ)− S

(θ)

S(θ) d

= Fm1,m2

where cn = n−qq−q1

, m1 = q − q1 and m2 = n− q.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 74: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Connections with the F test and with the log-likelihood ratio test.

To use the test Tn in practice we need to estimate σ2 by means ofsome estimator, σ2

1, that is n1/2-consistent under the alternative

hypothesis H1 (e. g. see Gasser, Sroka and Jennen-Steinmetz(1986)). Replacing σ2 by σ2

1 in Tn we have

Tn =(2 trace

(H

tH)2)−1/2

×(σ−21 RSS (HY,HPY)− trace

(H

tH))

=trace

(H

tH)

(2 trace (HtH)2

)1/2σ20 − σ2

1

σ21

.

See the paper by Fan et al. (2001) for a generalized likelihood ratiotest statistic.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 75: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Discussion about power

Assuming that our regression model is:

m(x) = mn(x) = mθ(x) + cng(x),

(Pitman alternatives), with cn → 0, it is well-known that most ofthe tests that compare a parametric estimator of m with anonparametric one, present a nontrivial power function (that is, thepower exceeds the probability of rejecting H0 when it is true) onlyfor sequences of local alternatives with cn → 0 slower than n−1/2.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 76: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Discussion about power

Summary of the asymptotic sensitivity (cn) to deviations from thehypothesis for Pitman alternatives:

Test H0 H1 cnF -test Param. Param. n−1/2

χ2(n−q) Param. Non param. n−1/4

RSS (HY,HPY) Param. Non param. n−1/2h−1/4

In the general case of a p-dimensional model, the rate would becn = n−1/2h−p/4.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 77: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Discussion about power

Another way of analyzing the properties of the asymptotic power isthe minimax approximation of Ingster (Ingster (1982), (1993a),(1993b) and (1993c)).

In this approach, m is assumed to belong to a class ofdifferentiable functions in R

p (Holder, Sobolev, Besov,...). Thisclass is generally denoted by B.

B is moved apart from the set characterizing the null hypothesis,M by a distance cn → 0.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 78: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Discussion about power

The goal of this minimax approximation is to find the rate cn whichconverges the fastest to zero. This rate makes the test uniformlyconsistent in B, and it is called the optimal rate of testing.

A test is said to be uniformly consistent on B if:

lımn→∞

ınfm∈B

P (Reject H0|m) = 1.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 79: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Discussion about power

A summary of the rates for this case is the following:

Holder, Sobolev or Besov classes (with bounded derivatives oforder s ≥ p/4, with s known):

n−2s4s+p .

See, for further details, Ingster (1982), Ingster (1993a, 1993b,1993c) and Guerre and Lavergne (2002).

For unknown s, Spokoiny (1996):

(n−1

√log logn

) 2s4s+p

.

If s < p/4, Guerre and Lavergne (2002):

n−1/4.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 80: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Discussion about power

In Horowitz and Sopokoiny (2001), the authors propose a testwhich is rate optimal in the minimax sense detecting Pitmanalternatives, with cn ∼ n−1/2

√log log n.

Essentially, this test considers a new test statistic:

˜Tn = maxh∈Hn

Sh(θ)− Nh

Vh

= maxh∈Hn

Th,

where Sh(θ) = ‖H(Y −mθ(X))‖2, Nh is the mean of Sh underH0, V

2h is the variance of Sh under H0 and Sh(θ), Nh and Vh are

the corresponding plug-in versions.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 81: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Discussion about power

Hn is a set of Jn smoothing values, for instance:

Hn = h = hmaxak, h ≥ hmin, k = 0, 1, 2, . . .,

with 0 < hmin < hmax and 0 < a < 1.

In this case, Jn = log1/a

(hmaxhmin

), and the test rejects H0 if any

of the Th is large.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 82: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the regression function.

Discussion about power

For the linear model, under H0:

˜Tn = maxh∈Hn

Tn(h),

with Tn described above.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 83: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

Tests based on the estimationof the integrated regression

function

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 84: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

The integrated regression function.

Consider (X,Y ), where F is the distribution function of X andm (x) = E (Y |X=x). The integrated regression function is

I (x) = E(Y 1X≤x

)= E

[E(Y 1X≤x

∣∣X

)]

= E[E (Y |X) 1X≤x

]= E

[m (X) 1X≤x

]

=

∫ ∞

−∞m (y) 1y≤xdF (y) =

∫ x

−∞m (y) dF (y) .

This function can be estimated empirically without smoothing bymeans of

In (x) =1

n

n∑

i=1

1Xi≤xYi.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 85: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

The integrated regression function.

This estimator is unbiased and

V ar (In (x)) =1

nV ar

(1X1≤xY1

)

=1

nV ar

(1X1≤xm (X1)

)

+1

nE(1X1≤xσ

2 (X1))

=1

n

∫ x

−∞

(m2 (y) + σ2 (y)

)dF (y)

− 1

nI (x)2 .

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 86: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

The marked empirical process.

Main idea:

compare a nonparametric estimator of the integrated regression withsome estimator based on the assumptions of the null hypothesis.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 87: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

The marked empirical process.

Simple null hypothesis case

Let’s first consider a simple null hypothesis H0: m = m0.

Nonparametric estimator:

In (x) =1

n

n∑

i=1

1Xi≤xYi.

Nonparametric estimator under H0:

I0 (x) =

∫ x

−∞m0 (y) dFn (y) =

1

n

n∑

i=1

1Xi≤xm0 (Xi) .

Thus,

In (x)− I0 (x) =1

n

n∑

i=1

1Xi≤x (Yi −m0 (Xi)) .

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 88: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

The marked empirical process.

The empirical process marked by the regression errors:

Rn (x) = n1/2 (In (x)− I0 (x))

= n−1/2n∑

i=1

1Xi≤x (Yi −m0 (Xi))

has been studied by Stute (1997), who proved (underE(Y 2)< ∞) that

Rnd−→ R∞,

in the Skorohod space D [−∞,∞], where R∞ is a Brownianmotion with respect to time

T (x) =

∫ x

−∞V ar (Y |X=u) dF (u) =

∫ x

−∞σ2 (u) dF (u) .

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 89: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

The marked empirical process.

To test H0 we only need to choose some functional (for instancethe supremum that will lead to the Kolmogorov-Smirnov statistic).The critical value can be obtained from the distribution of such anfunctional computed from R∞.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 90: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

The marked empirical process.

Composite null hypothesis case

The null hypothesis under study now is H0: m ∈ mθ/θ ∈ Θ.Consider θ a suitable estimator of the true θ (say θ0).Now the goodness-of-fit test statistics will be based on the process

R1n (x) = n−1/2

n∑

i=1

1Xi≤x

(Yi −mθ (Xi)

).

Under fairly general assumptions Stute (1997), has proved that R1n

converges in distribution to a centered gaussian limit, R1∞, with a

very complicate covariance structure.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 91: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

The marked empirical process.

As a consequence the principal components of R1∞ are difficult to

obtain. This makes a real problem for full model checks, sinceoptimal Neyman-Pearson tests for H0 versus a given directionallocal alternative depend on these principal components.

Possible solution: use the Bootstrap!

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 92: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

Bootstrap approximations.

Main idea: Find a Bootstrap approximation of R1n. (X∗

i , Y∗i )ni=1

a bootstrap resample (to be defined later). θ∗: the least squaresestimator computed with this sample.The bootstrap version of R1

n is

R1∗n (x) = n−1/2

n∑

i=1

1X∗i ≤x

(Y ∗i −mθ∗ (X

∗i )).

Ψ: a continuous functional to define the test statistic:

Tn = Ψ(R1

n

).

Reject H0 if Tn > c∗α, for c∗α satisfying

P ∗(Ψ(R1∗

n

)> c∗α

)= α.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 93: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

Bootstrap approximations.

Intuitively c∗α is a reasonable estimator of the cα satisfying

P(Ψ(R1

n

)> cα

)= α.

In practice we usec∗α = T ∗(⌈B(1−α)⌉)

n

the ⌈B (1− α)⌉-th order statistic of the bootstrap replicationsT ∗jn , j = 1, 2, . . . , B.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 94: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

Bootstrap approximations.

The naive Bootstrap

Efron (1979)

Draw (X∗i , Y

∗i )ni=1 from the empirical cdf of the original

sample.

It is not consistent!

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 95: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

Bootstrap approximations.

The smooth Bootstrap

Cao and Gonzalez Manteiga (1993)

The bootstrap resamples are drawn from the followingbivariate distribution function

Fn (x, y) = n−1n∑

i=1

1Yi≤y

∫ x

−∞Kh (t−Xi) dt.

It is not consistent!

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 96: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

Bootstrap approximations.

The naive resampling of the residuals

The bootstrap resamples (X∗i , Y

∗i )ni=1 are obtained as:

Step 1. Construct the parametric residuals:

εi = Yi −mθ (Xi) , i = 1, 2, . . . , n.

Step 2. Recenter the previous residuals:

εi = εi − ε, i = 1, 2, . . . , n, where ε =

∑ni=1 εin

.

Step 3. Draw bootstrap versions of the residuals, ε∗i , from theempirical cdf of the

εini=1

.

Step 4. Compute Y ∗i = mθ (Xi) + ε∗i , i = 1, 2, . . . , n (no resampling

of the X’s).

It is not consistent!W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 97: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

Bootstrap approximations.

The wild Bootstrap

Wu (1986), Liu (1988), Hardle and Mammen (1993).

Step 1. Construct the parametric residuals:

εi = Yi −mθ (Xi) , i = 1, 2, . . . , n.

Step 2. Draw independent r.v. V ∗1 , V

∗2 , . . . , V

∗n (also independent of

the observed sample) satisfying

E∗ (V ∗i ) = 0, E∗

(V ∗2i

)= 1, E∗

(V ∗3i

)= 1

and construct the ε∗i = εiV∗i .

Step 3. Compute Y ∗i = mθ (Xi) + ε∗i , i = 1, 2, . . . , n (no resampling

of the X’s).

This Bootstrap is consistent (see Stute, Gonzalez-Manteiga andPresedo-Quindimil (1998)).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 98: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Tests based on the estimation of the integrated regression function.

Bootstrap approximations.

Under H0

R1∗n

d−→ R1∗∞,

with probability one in the space D [−∞,∞], where R1∗∞ and R1

have the same distribution. See Stute et al. (2006) for ageneralization with dependent data.

Another possibility is the calibration with Martingaletransformations. See, for instance, the paper by Stute et al. (1998)or the more recent work of Khmaladze and Koul (2004).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 99: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Related setups, extensions andopen problems

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 100: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing the equality of two regression curves

One of the most important problems in statistical inference is thecomparison of two ore more groups of variables. This comparisoncan be performed by comparing means, medians or any othercharacteristic of the variable of interest, namely Y , measured inthe sample for each group.

When this variable is accompanied by a regression covariable X, amore ambitious objective is to compare the regression functions

ml(x) = E (Yl|X = x)

associated to each group.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 101: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing the equality of two regression curves

If the codification of the regression functions in the differentgroups is parametric:

ml = mθl , θl ∈ Θ ⊂ Rq,

we have the classical covariance analysis.The case where the regression in the groups is nonparametric:

mlkl=1, ml ∈ M,

being M a functional space satisfying some regularity conditions,is quite recent (last fifteen years).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 102: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing the equality of two regression curves

In general, such a model can be written as:

Yij = mi(tij) + σi(tij)εij , i = 1, . . . , k, j = 1, . . . , ni,

where εij are i.i.d., zero-mean; mi and σ2i are the regression and

variance function for the ith group and tij varies in [0, 1], withoutlost of generality. The problem is to test:

H0 : m1 = . . . = mk, vs. Ha : ∃(i, j) such that mi 6= mj .

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 103: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing the equality of two regression curves

This problem can be approached by different perspectives:

Using smoothing techniques (e.g. Hardle and Marron (1990),Hall and Hart (1990), King et al. (1991), Young and Bowman(1995), Kulasekera (1995), Kulasekera and Wang (1997,1998), Hall et al. (1997), Lavergne (2001), Dette andNeumeyer (2001) and Vilar-Fernandez and Gonzalez-Manteiga(2004) among others).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 104: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing the equality of two regression curves

For solving this problem, it is crucial to estimate

D =∑

i<j

∫ 1

0(mi(t)−mj(t))

2dt

considering the test statistic

Qn =∑

i<j

∫ 1

0(mi(t)− mj(t))

2ωij(t)dt,

where ωiji<j are weight functions, and

ml(t) =n∑

i=1

Wli(t)Yli, l = 1, . . . , k

the nonparametric kernel estimators of the regression functions.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 105: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing the equality of two regression curves

A general result, considering a correlation structure in the errors(for k = 2) is given in Vilar-Fernandez and Gonzalez-Manteiga(2004):

√n2h

(Qn − 1

nhIωΓ∆CK

)→ N (0, σ2

Q),

where

Γ∆ =

∞∑

k=−∞

γ∆(k), CK =

∫K2, Iω =

∫ω,

σ2Q = 2Γ2

∫(K ∗K)2

∫ω2,

and γ∆ denotes the covariance of the differences ε1t − ε2t.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 106: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing the equality of two regression curves

Similar results, for independent errors, were given by Dette andNeumeyer (2001) for the statistic:

Q1n = σ2 − 1

N

k∑

i=1

niσ2i ,

where

σ2i =

1

ni

ni∑

j=1

(Yij − mi(tij))2, i = 1, . . . , k

σ2 =1

N

k∑

i=1

ni∑

j=1

(Yij − m(tij))2

with N =∑k

i=1 ni and m the nonparametric estimator for thewhole sample. The authors also give some results for

Q2n =

1

N

k∑

i=1

ni∑

j=1

(m(tij)− mi(tij))2.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 107: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing the equality of two regression curves

In Munk and Dette (1998), the authors suggest to estimatedirectly D, but this procedure can only detect alternatives at therate n−1/4, compared with the rate (n2h)−1/4 of the other tests.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 108: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing the equality of two regression curves

Using empirical regression processes.

For instance, suppose that we have (X,Y 1) and (X,Y 2) and ouraim is to estimate whether

H0 : E(Y 1|X = x

)= E

(Y 2|X = x

), ∀x ∈ I ⊂ R.

This is equivalent to test:

H0 : E(Y 1 − Y 2|X = x

)= 0,

(see the papers by Delgado (1993) or, more recently, Ferreira andStute (2004) for dependent data). In this context, we couldconsider the empirical process:

Rn(x) =1√n

n∑

i=1

1Xi≤x(Y1i − Y 2

i ).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 109: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing the equality of two regression curves

Using a mixture of both ideas.

For instance, in the paper by Neumeyer and Dette (2003), theauthors use a marked empirical process of the type:

RN (x) =1

N

n1∑

j=1

f1j1X1j≤t −1

N

n2∑

j=1

f2j1X2j≤t

with

fij =N

ni

(Yij − m(Xij))

τi(Xij), and τi(x) =

1

nih

ni∑

j=1

K

(x−Xij

h

)

the density estimation of X in the ith population.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 110: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing the equality of two regression curves

In all these cases, the calibration of the test statistics distributions(for example, using Bootstrap) is very important. SeeVilar-Fernandez et al. (2007) for an exhaustive comparisonassuming correlated errors.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 111: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing partial linearity.

Testing within the partially linear model

Let us consider the partially linear regression model:

Yi = Xtiβ +m(ti) + εi, i = 1, . . . , n

where the (p× 1) parameter vector β and the function m areunknown. Xi and ti denote the design points (random and fixed,respectively). Regarding this model, different testing problems canbe considered, for instance:

H0β : β = β0,

H0m : m = m0,

H l0m : m ∈ spanf1, . . . , fl, where fj for j = 1, . . . , l are

linearly independent functions.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 112: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing partial linearity.

−0.6

−0.4

−0.2

−0.0

0.2

0.4

0.6

Tem

pera

ture

ano

mal

y (°

C)

−0.6

−0.4

−0.2

−0.0

0.2

0.4

0.6

Tem

pera

ture

ano

mal

y (°

C)

−0.6

−0.4

−0.2

−0.0

0.2

0.4

0.6

Tem

pera

ture

ano

mal

y (°

C)

1860 1880 1900 1920 1940 1960 1980 2000−0.6

−0.4

−0.2

−0.0

0.2

0.4

0.6

Tem

pera

ture

ano

mal

y (°

C)

Global air temperature2007 anomaly +0.40°C(8th warmest on record)

See Gao (2007).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 113: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing partial linearity.

The problem of testing on the nonparametric components of themodel can be approached by smoothing techniques. For instance,we may use the distance:

d(m,H l0m) =

1

n

n∑

i=1

(mh(ti, β)− F t(ti)θ

)2,

where F (t) = (f1(t), . . . , fl(t))t, θ = (θ1, . . . , θl)

t and tini=1 arethe design points.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 114: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing partial linearity.

In this case, the estimator of the parameter vector:

β =(Xt

bXb

)−1Xt

bYb, Xb = (I −Wb)X, Yb = (I −Wb)Y,

with Wb the smoothing kernel matrix with bandwidth b, beingX = (X1, . . . , Xn)

t and

mh(ti, β) =n∑

j=1

Wnj(ti)(Yj −Xtj β).

See, for instance, the papers Robinson (1988) and Speckman(1988). In this case, θ is taken as a minimum distance estimator.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 115: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing partial linearity.

In Gonzalez-Manteiga and Aneiros-Perez (2003), the authorsprovide a general result:

√n2h

(d(m,H l

0m)− 1

nh

∞∑

s=−∞

γ(s)

∫K2(u)du

)→ N

(∫g2(x)dx, σ2

d

),

consideringm(·) = F t(·)θ0 + (n2h)−1/4g,

with a correlation structure for the errors, denoting by γ itscovariance function.

Generalizations for long-memory processes are considered inAneiros et al. (2004). Once again, the Bootstrap calibration isimportant for the practical application of the tests.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 116: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing partial linearity.

Testing the partially linear model

Suppose that Y is the response variable in a partially linear modeland Xt = (Xt

1, Xt2) is the vector of covariables. Assume that we

are interested in testing the following null hypothesis:

H0 : E(Y |X) = Xt1θ0 +m(X2).

In Fan and Li (1996), the authors propose a consistent test basedon smoothing methods. Following the ideas of Zheng’s test (Zheng(1996)), they suggest estimating:

E(C2) = E(ε0E(ε0|X)f(X)π2(X)),

where ε0 = Y −Xt1θ0 −m(X2).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 117: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing partial linearity.

In particular, in Fan and Li (2003) the authors consider π2 = 1 anduse a density weighted version of the estimator, in order to avoidthe problem of a random denominator. Zhu and Ng (2003)proposed a test based on empirical regression processes ideas.Define:

U(X1, X2) = X1 − E(X1|X2), V (X2, Y ) = Y − E(Y |X2),

S = E(UU tω2(X2))

β = S−1E(U(X1, X2)V (X2, Y )ω2(X2)),

where ω is a weight function. In this context, H0 is true if and onlyif:

E(Y − βtU(X1, X2)−m(X2))ω(X2)1X2≤u,X1≤x = 0, ∀(u, x).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 118: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing partial linearity.

In this way, if (X1i, X2i, Yi)ni=1 is the initial sample, theempirical regression process is given by:

Rn(u, x) =1√n

n∑

i=1

εi0ω(X2i)1X2j≤u,X1j≤x,

and it can be used considering the residuals estimated under thenull hypothesis:

εi0 = Yi − βtU(X1j , X2j)− m(X2).

They also study the weak convergence of this empirical process andthe tests associated with it (for instance, the Cramer-von-Misestest). See the section on significance tests for related aspects

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 119: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Generalized linear regression models

Suppose first that Y is a binary random variable. The function ofinterest:

p (x) = P (Y = 1|X=x) = E (Y |X=x) = m (x) .

The main problem is to test H0: p ∈ pθ/θ ∈ Θ. As an example,consider the logistic regression binary model:

pθ (x) =exp (θ0 + θ1x)

1 + exp (θ0 + θ1x)= mθ (x) .

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 120: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Generalized linear regression models

Let’s consider the generalized linear regression hypothesis studiedby Rodrıguez-Campos et al. (1998):

H0 : p ∈ pθ (x) = G (θ0 + θx) /θ0 ∈ Θ0, θ ∈ Θ ,

where G is a known link function, Θ0 ⊂ R and Θ ⊂ Rq. The

statistic:

d(m,mθ

)=

1

n

n∑

j=1

(m (Xj)−mθ (Xj)

)2w (Xj)

is used to test H0.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 121: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Generalized linear regression models

Similarly to the cases studied in Section 2, asymptotic normality isalso obtained. A binary bootstrap method can be used toapproximate the critical values of the test as follows:

1. Compute the estimator θ.

2. Let Y ∗i = 1 with probability pθ (Xi) and Y ∗

i = 0 withprobability 1− pθ (Xi).

This bootstrap method is consistent under H0.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 122: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Generalized linear regression models

In the paper by Muller (2001), a more general model (generalpartial linear model: GPLM)

E(Y |X,T ) = G(γ +Xtθ +m(T )),

is tested. For instance, suppose that we are interested in testing:

H0 : E(Y |X,T ) = µ(X,T ) = G(Xtθ + T tγ + γ0),

vs.Ha : E(Y |X,T ) = G(Xtθ +m(T )).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 123: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Generalized linear regression models

Denoting by:µi = G(Xt

i θ + m(Ti)),

µi = G(Xti θ + T t

i γ + γ0),

for i = 1, . . . , n. A natural approach is to compare both estimatesby a likelihood ratio test statistic (recall the GLRT commentedbefore):

R = 2n∑

i=1

(L(µi, Yi)− L(µi, Yi))

applied to the data Yi, Xi, Tini=1.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 124: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Generalized linear regression models

Following the ideas of smoothing the hypothesis as in previoussections, we could apply this method but replacing the originaldata by the artificial sample:

G(Xti θ + T t

i γ + γ0), Xi, Ti.

That is to say, replacing the second argument in L(·, ·) by aparametric estimation of E(Y |X,T ).

See Muller (2001) for more details.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 125: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Significance tests.

Tests based on smoothing and on empirical regression processescan be used to test the significance of a possible covariate in aregression model:

H0: E (Y |X) = E(Y |X1

)= m1(X1),

where Xt =(Xt

1, Xt2

), with X1 of dimension p1, X2 of dimension

p2 and X of dimension p = p1 + p2.

Fan and Li (1996) studied this problem using smoothing basedstatistics. Under the null hypothesis,

Yi = m1(X1i) + ε0i, i = 1, . . . , n

whereE(ε0i|Xi) = m(Xi)−m1(Xi) = 0.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 126: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Significance tests.

Observing that:

E(ε0iE(ε0i|Xi)) = E((ε0i|Xi)2) ≥ 0

the equality holds if and only if H0 is true. Then, a consistent testcan be built by estimating:

I = E(ε0E(ε0|X)).

To overcome the problem of a random denominator in kernelestimation, a density weighted version of

1

n

n∑

i=1

ε0iE(ε0i|Xi)

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 127: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Significance tests.

is given by

1

n

n∑

i=1

ε0ifX1(X1i)E(ε0ifX1

(X1i)|X1i)f(Xi),

where fX1is the density of X1 and f denotes the density of X.

The nonparametric estimation of the last expression is the statisticproposed by Fan and Li (1996).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 128: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Significance tests.

An alternative way is to estimate:

I = E((ε0|X)2).

An integrated version of the estimation of I is:

In =

∫ (1

nhp

n∑

i=1

K

(x−Xi

h

)(Yi − m1(X1i))

)2

p(x)dx,

where p(·) is a weight function.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 129: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Significance tests.

A general result is given in Gonzalez-Manteiga et al. (2002):

nhp/2In − h−p/2B√V

→ N (0, 1),

where

B =

∫σ2(x)f(x)p(x)dx

∫K2(u)du,

and

V = 2

∫σ4(x)f2(x)p2(x)dx

∫(K∗K)2(u)du, σ2(x) = Var(Y |X = x).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 130: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Significance tests.

The previous tests require smoothing both under the null andalternative hypothesis. In Delgado and Gonzalez-Manteiga (2001),another methodology, requiring only the smoothing under H0, isproposed. Assuming that the distribution of X1 admits a densityfX1

, note that:

H0 : E(Y −m1(X1)|X) = 0

is equivalent to test that

fX1(X1)E(Y −m1(X1)|X) = 0.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 131: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Significance tests.

If we define:

T (x) = E(fX1(X1)(Y −m1(X1))1X≤x),

then the null hypothesis can be formulated as: H0 : T (X) = 0. Inthis way, an empirical regression process is given by:

Tn(x) =1

n

n∑

i=1

fX1(X1i)(Yi − m1(X1i))1X1i≤x.

The weak convergence of the process is obtained in Delgado andGonzalez-Manteiga (2001) and functionals of the process Tn areused as test statistics. For instance, a Cramer-von-Mises statistic:

cn =

∫(n1/2Tn)

2dF =

∫ n

i=1T 2n(Xi).

Different ways for calibrating the distribution of the test statisticsare also discussed in Delgado and Gonzalez-Manteiga (2001).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 132: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

Let us consider the multivariate regression modelY = m(X) + σ(X)ε with a vector of covariatesX = (X1, . . . , Xp)

t. In order to avoid the curse of dimensionality,an additive form for the regression function

m(X) = c+

p∑

i=1

mi(Xi)

is assumed. This is a simpler and more interpretable model, whichmakes it possible to explore the influence of each covariate in theresponse variable. With this framework, different additive modelchecking methods have been developed.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 133: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

Testing additivity.

An important problem in this context is to test whether:

H0 : m(x) = c+

p∑

i=1

mi(xi),

vs.Ha : m is smooth,

where mipi=1 is a collection of unknown functions. Somereferences on this topic are Eubank et al. (1995), or more recently,Dette et al. (2005), Gozalo and Linton (2001), Dette et al. (2001)or Fan and Jiang (2005). Denoting by m the generalnonparametric estimator and m0 the one under the null hypothesis,we may consider different test statistics:

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 134: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

Generalizing the ideas in Gonzalez-Manteiga and Cao (1993)and Hardle and Mammen (1993):

T1n =1

n

n∑

i=1

(m(Xi)− m0(Xi))2 .

Regarding the correlation between the residuals and thesuitable function of X, following Gozalo and Linton (2001),and taking ei = Yi − m0(Xi):

T2n =1

n

n∑

i=1

ei(m(Xi)− m0(Xi)).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 135: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

Considering the difference between the variance estimators(see the works by Dette et al.), with ui = Yi − m(Xi):

T3n =1

n

n∑

i=1

(e2i − u2i ).

Or extending the test proposed by Zheng (1996):

T4n =1

n(n− 1)

i 6=j

Lg(Xi −Xj)eiej , Lg(·) =1

gpL

( ·g

)

being L a multidimensional kernel.

In recent works such as Dette et al. (2005), marginal integration isused to estimate m under the null hypothesis of additive structure.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 136: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

Testing on the additive components.

Let p0 be an integer and consider the following testing problem:

H0 : mp−p0(Xp−p0) = . . . = mp(Xp) = 0

vs.Ha : mp−p0(Xp−p0) 6= 0, . . . or . . . ,mp(Xp) 6= 0.

For solving this problem, Fan and Jiang (2005) proposed a GLRTgiven by:

λn =n

2log

RSS0

RSS1≈ n

2

RSS0 −RSS1

RSS1,

where

RSS0 =

n∑

i=1

(Yi−c−p0−1∑

k=1

mk(Xki))2, RSS1 =

n∑

i=1

(Yi−c−p∑

k=1

mk(Xki))2,

with ml and ml the nonparametric estimators of the lthcomponent under the null and the alternative hypothesis.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 137: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

In Fan and Jiang (2005), the authors consider a Backfittingalgorithm for the estimation of the additive components and thep-value of the test statistic is approximated by Bootstrap. Otherapproaches consider testing for parametric structures in thecomponents.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 138: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

An example of non-additive model:

C

E

NOx

Engine exhaust fumes from burningethanol.

NOx: concentration of nitrogenoxides.

C: compression ratio of the engine.

E: equivalence ratio.

Null hypothesis:

H0 : NOx = c+m1(C) +m2(E).

p-values: 0.074,0.036,0.026,0.004.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 139: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

Marginal estimations (backfitting).

Compression ratio: C

8 10 12 14 16 18

−0.

4−

0.2

0.0

0.2

0.4

X1

lo(X

1)

Equivalence ratio: E

0.6 0.7 0.8 0.9 1.0 1.1 1.2

−2

−1

01

X2

lo(X

2)

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 140: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

An example of additive model:

rm

lstat

medv

Housing data for 506 census tractsof Boston from the 1970 census.

medv: median value ofowner-occupied homes in USD1000’s.

rm: average number of rooms.

lstat: percentage of lower statusof the population.

Null hypothesis:

H0 : medv = c+m1(rm)+m2(lsat).

p-values: 0.968,0.970,0.970.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 141: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

An example of additive model:

AN Ctime

Yn

Naphthalene dataset.

Yn: percentage mole conversion ofnaphthalene to naphthoquinone.

AN: air to naphthalene ratio(X1 = log(AN)).

Ctime: contact time(X2 = log(Ctime)).

Null hypothesis:

H0 : Yn = c+m1(X1) +m2(X2).

p-values: 0.672,0.506,0.238,0.380.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 142: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

Testing for interactions.

Consider the following decomposition of the additive model:

m(X) = c+

p∑

d=1

md(Xd) +∑

1≤i≤j≤p

mi,j(Xi, Xj).

We could be interested in testing:

H0 : md1,d2(Xd1 , Xd2) = 0 vs. Ha : md1,d2(Xd1 , Xd2) 6= 0.

In Sperlich et al. (2002), the authors introduced the following teststatistic:

T =

∫m2

d1,d2(Xd1 , Xd2)π(Xd1 , Xd2)dXd1dXd2 ,

where the nonparametric estimator md1,d2 is obtained by marginalintegration. The calibration of the distribution of the test statisticT , as in many other occasions, is also obtained by Bootstrap.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 143: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

More generally, Roca-Pardinas et al. (2005) consider the problemof testing for interactions in Generalized Additive Models (GAM).The testing problem is formulated as follows:

H0 : m(X) = H

c+

p∑

d=1

md(Xd) +∑

1≤i≤j≤p

mi,j(Xi, Xj)

.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 144: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

The authors study the case of a logistic GAM model, that is:

ηX = log

(p(X)

1− p(X)

)= c+

p∑

d=1

md(Xd) +∑

1≤i≤j≤p

mi,j(Xi, Xj),

and use a likelihood ratio test:

T =n∑

i=1

(e(p(d0

1,d0

2)(Xi), Yi)− e(p(Xi), Yi)

)

where p(d01,d0

2) and p denote the estimator under the null and the

alternative hypotheses, being

e(p, y) = −2(y log p+ (1− y) log(1− p)).

The estimation procedure is based on local scoring with Backfittingand the calibration of the distribution of T is done by Bootstrap.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 145: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

Ethanol dataset (p-value: 0.00 )

No interaction

C

E

NOx

Interaction

C

E

NOx

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 146: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

Boston Housing dataset

No interaction

rm

lstat

medv

Interaction

rm lstat

medv

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 147: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

Marginal estimations for Boston Housing (p-value: 0.01.)

rm: average number of rooms.

4 5 6 7 8

05

1015

20

X1

lo(X

1)

lstat: percentage of lower status of thepopulation.

0.5 1.0 1.5 2.0 2.5 3.0 3.5

−15

−10

−5

05

1015

20

X4

lo(X

4)

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 148: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Testing in additive models.

Naphthalene dataset (additive model!)

No interaction

AN Ctime

Yn

Interaction

AN Ctime

Yn

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 149: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Goodness-of-fit for regressions models with incomplete data.

The case of censored and/or truncated data using smoothingtechniques In general, we observe a random vector (X,T, Z, δ),where X is the covariate, T is the left truncation time and Z is theobserved lifetime, defined as Z = mın Y,C, where Y is the truelifetime and C is the right censoring time. The indicatorδ = 1Y≤C takes account of the fact that the datum is censoredor not. We are only able to observe the whole vector when T ≤ Z.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 150: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Goodness-of-fit for regressions models with incomplete data.

In Cao and Gonzalez-Manteiga (2008) different models are testedconsidering an initial sample Xi, Zi, Ti, δini=1.

a.1) General Regression (GR) and polynomial regression (PR)models:

H0 : T (F (·|X)) = At(X)θ,

where A : Rq → Rp is a known function, θ = (θ0, . . . , θp)

t ∈ Rp+1

and

T (N) =

∫ 1

0N−1(s)J(s)ds,

for any distribution N and N−1(s) = ınfu|N(u) ≥ s, thequantile function. J is a nonnegative real function satisfying∫ 10 J(s)ds = 1 and F (·|x) is the conditional distribution ofY |X = x.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 151: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Goodness-of-fit for regressions models with incomplete data.

Observe that if J is the uniform density:

T (F (·|x)) = E(Y |X = x) = At(x)θ,

the null hypothesis considered in Section 2. Besides, if q = 1 andA(x) = (1, x, . . . , xp)t, we have the PR model.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 152: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Goodness-of-fit for regressions models with incomplete data.

a.2) Proportional hazard (PH) models:

H0 : λ(t|x) = λ0(t) exp(At(x)θ),

where λ(·|x) is the condition hazard rate and λ0(·) theso-called baseline hazard rate (this is the well-known Coxregression model).

a.3) Additive risks (AR) models:

H0 : λ(t|x) = λ0(t) +At(x)θ.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 153: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Goodness-of-fit for regressions models with incomplete data.

a.4) Proportional odds (PO) models:

H0 : logit(1−exp(−∆(t|x))) = P(Y ≤ t|X = x)

P(Y > t|X = x)= α0(t)+At(x)θ,

where α0(·) is an increasing function and logit(u) = log

(u

1− u

).

Here, ∆(·|X) denotes the cummulative hazard rate.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 154: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Goodness-of-fit for regressions models with incomplete data.

All these models have been studied in detail by Grigoletto andAkritas (1999). For example, for the simple case q = 1 andA(x) = (1, x, . . . , xp)t, a general statistic for the differenthypothesis can be given by:

Dn = argmınθ

1

n

n∑

r=1

(Ωr − (θ0 + θ1Xr + . . .+ θpXpr ))

2,

where Ωr is an estimator of Ωr, a suitable transformation of∆(·|X).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 155: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Goodness-of-fit for regressions models with incomplete data.

For (a.1), we must take T (F (·|x)); for (a.2), we must take∫ ∞

0log(∆(·|x)), with W a nonnegative weight function and

∫ 10 dW (s) = 1; for (a.3), the transformation is

∫ ∞

0∆(s|x)dW (s), W (s) =

W (s)∫∞0 udW (u)

.

Finally, for (a.4):

∫ ∞

0logit(1− exp(−∆(s|x)))dW (s).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 156: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Goodness-of-fit for regressions models with incomplete data.

These nonparametric estimators of Ω(·|x) are obtained using anonparametric estimation of F (·|x). In the case of censored andtruncated data, this estimator was obtained by Iglesias-Perez andGonzalez-Manteiga (1999). In Cao and Gonzalez-Manteiga (2008),general limit theory is derived for the test n

√hDn, with additional

Bootstrap calibration. In the complete data case for thetransformation of ∆(·|x) = T (F (·|x)) = E(Y |X = x), the test ofCao and Gonzalez-Manteiga (1993) is obtained.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 157: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Goodness-of-fit for regressions models with incomplete data.

The case of missing data using smoothing techniques

In Gonzalez-Manteiga and Perez-Gonzalez (2006), agoodness-of-fit test adapted to the situation where the Y variablemay be missing is proposed. Consider the general model:

Y = m(X) + σ(X)ε.

In the complete data context, we observe a sample Xi, Yini=1 of(X,Y ) ∈ R

p+1. In the missing data case, we may not observe Yifor some index i, which implies that we have to deal with: (Xi, Yi)if Yi is observed and (Xi, ·), otherwise. To control whether anobservation is complete or not, a new variable δ is introduced, asan indicator of missing observations. Thus, if δi = 1, Yi isobserved; in other case, δi = 0.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 158: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Goodness-of-fit for regressions models with incomplete data.

Under the assumption of missing at random:

P(δ = 1|Y,X) = (δ = 1|X) = p(X), X ∈ Rp,

limit distributions have been obtained in Gonzalez-Manteiga andPerez-Gonzalez (2006) for the test statistics:

Tn,S = n|H|1/4∫(mS,H(x)−mθ(x))

2ω(x)dx,

and

Tn,I = n|H|1/4∫(mI,H,G(x)−mθ(x))

2ω(x)dx

under the model m(x) = θ0 + θt1x+ cnS(x) (Pitman alternatives).mS,H is the multidimensional local linear regression estimator withthe complete observed data and bandwidth matrix H; mI,H,G isthe nonparametric estimation where the missing Y data areimputed nonparametrically with bandwidth G. The function S isorthogonal to MΘ = mθ(x) = θ0 + θt1xθ=(θ0,θ1)t∈Θ.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 159: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Goodness-of-fit for regressions models with incomplete data.

In the particular case of complete data and p = 1, it can be seenthat cn = (n|H|1/4)−1/2 = (nh2)−1/2 and we would have the testgiven by Alcala et al. (1999).It is necessary to point out that the behaviour of the tests willevidently depend on the choice of the smoothing parameters. Thebehaviour of

αn =|G|1/2|H|1/2 → α

is crucial to decide whether imputation is convenient or not. Whenα = ∞, we have oversmoothing (large bias effect) in theimputation and then, the behaviour of the imputed test could beworse. When α = 0, both tests are equivalent. Imputation couldimprove the performance of the test for α ∈ (0,∞).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 160: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

Testing the trend with smoothing methods

In the (general) dependent data context (Xt, Yt1≤t≤T is asequence of observations from a joint stationary density function,f(x, y) corresponding to (X,Y ), a (d+ 1)-dimensional randomvector. Apart from the joint density f(x, y), there are also otherimportant functions for describing the behaviour of (X,Y ).For instance, the marginal density of X, namely π(x); theconditional density of Y given X = x, denoted by f(y|x) and theconditional moments mj(x) = E(Y j |X = x), for j = 1, 2, . . .. Forj = 1, we have the conditional expectation and m1 is simplydenoted by m.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 161: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

Consider the model Yt = m(Xt) + εt, t = 1, . . . , T where εt isan i.i.d. sequence with E(εt) = 0 and E(ε2t ) = σ2 < ∞ and Xtis strictly stationary. If we are interested in testing the nullhypothesis of m belonging the a certain parametric family:

H0 : m ∈ mθ, θ ∈ Θ,

it is possible to generalize to this context the tests introduced byHardle and Mammen (1993) and Zheng (1996), among others,which have already been described for the i.i.d. case.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 162: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

For instance, the generalization of Zheng’s test can be carried outas follows (detailed computations can be found in Gao (2007)).Consider the test statistic:

d2(m,H0) =hd/2

T

T∑

s=1

T∑

k=1

esKh(Xs −Xt)et = LT (h),

with et = Yt −mθ(Xt).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 163: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

The limit distribution of this test statistic can be given in terms ofa functional similar to:

RT (h) =T∑

s=1

T∑

k=1

εsφT (Xs, Xt)εt

=

T∑

s=1

φT (Xs, Xt)ε2s +

s 6=t

φT (Xs, Xt)εsεt

with εt = Yt −mθ(Xt) and φT (·) is a function depending of T , hand K(·).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 164: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

For instance, Zheng’s test can be written as (approximately):

d2(m,H0) ≈1

Thd/2

T∑

s=1

T∑

t=1

εsK

(Xs −Xt

h

)εt = QT (h).

Under some regularity conditions (see Gao (2007), pp.72-74),assuming that Xt is α−mixing and Xs and εt areindependent for s ≤ t, it can be proved that:

supx∈R

∣∣∣∣P(QT (h)− E(QT (h))

σT≤ x

)− Φ(x) + κT (x

2 − 1)φ(x)

∣∣∣∣ ≤ Chd

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 165: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

supx∈R

∣∣∣∣P(QT (h)− E(QT (h))

σT≤ x

)− Φ(x) + κT (x

2 − 1)φ(x)

∣∣∣∣ ≤ chd

where

σ2T = (µ4 − µ2

2)K2(0)

Thd+ 2µ2

2v2

∫K2(u)du,

κt =hd/2

σ3T

(µ23K

2(0)

Thd+

4µ32v3K

(3)(0)

3

)

with µk = E(εk1), vl = E(πl(X1)), K(3) denotes the

three-convolution of K and C > 0.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 166: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

Hence, for a significance level α, the null hypothesis is rejectd ifLT (h) ≥ l∗α, where l∗α is the Bootstrap (1− α)-cuantile, that is:

P∗(LT (h) ≥ l∗α) = α.

See also Li and Wang (1998), Kreiss et al. (2002) and Franke et

al. (2002) for other general resampling methods.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 167: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

It can be also proved that (see Gao (2007)), under H0, it holdsthat

supx∈R

∣∣∣P∗(L∗T (h) ≤ x)− P(LT (h) ≤ x)

∣∣∣ = O(hd/2),

P(LT (h) ≥ l∗α) = α+O(hd/2), and P(LT (h) > l∗α) → 1

under H1 (alternative hypothesis).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 168: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

As well as the generalization of Zheng’s test, other procedures suchas Hardle and Mammen (1993) test can be also generalized.Besides, other hypothesis may be also testing, such as testing forsubset regression, testing in partially linear models, in additivemodels, etc.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 169: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

In addition, when h → 0 and Thd → ∞:

supx∈R

∣∣∣∣P(QT (h)− E(QT (h))

σT≤ x

)− Φ(x) + κT (x

2 − 1)φ(x)

∣∣∣∣ −→ 0.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 170: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

In order to calibrate the distribution of LT (h), a studentizedversion

LT (h) =LT (h)− E(LT (h))√

Var(LT (h))

is considered, and it can be proved that LT (h) = LT (h) + oP(1)for each h, where

LT (h) =

∑s 6=t esKh(Xs −Xt)et√

2∑T

s=1

∑Tt=1 e

2sK

2h(Xs −Xt)e2t

.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 171: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

With this formulation, it is easy to check that LT (h) is invariantwith respect to σ2 and its distribution can be approximated byBootstrap. For that purpose, Bootstrap samples will be obtainedfrom Y ∗

t = mθ(Xt) + e∗t , where E(e∗t ) = 0 and Var(e∗t ) = 1. Withthe artificial data (Xt, Y

∗t )1≤t≤T , a Bootstrap version of LT (h)

is obtained as:

L∗T (h) =

∑s 6=t e

∗sKh(Xs −Xt)e

∗t√

2∑T

s=1

∑Tt=1 e

∗2s K2

h(Xs −Xt)e∗2t

,

being e∗s = Y ∗s −mθ∗(Xs). The distribution of L

∗T (h), that is:

P∗(L

∗T (h) ≤ x) = P

(L∗T (h) ≤ x|(X1, . . . , XT )

)

is approximated by Monte Carlo resampling.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 172: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

Testing the trend with empirical processes.

The empirical regression process introduced before can begeneralized to a more general framework as follows. ConsiderYt ∈ R and Xt = (Yt−1, . . . , Yt−s, Zt), being Zt a p-dimensionalrandom vector. If Xt = Yt−1, Koul and Stute (1999) studied theproblem of testing linearity.The same problem is also tackled by Dominguez and Lobato(2003) taking Xt = (Yt−1, . . . , Yt−s), and in this same case, Stuteet al. (2006) introduced a test for the link of a linear model.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 173: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

In Escanciano (2007), a nonlinear model for m(x) = E(Y |X = x)is tested. The author introduces a general empirical regressionprocess given by:

Rn,ω(x, θ) = n−1/2n∑

t=1

(Yt −m(Xt, θ))ω(Xt, x).

Some particular cases taking ω(Xt, x) = I(Xt ≤ x) (I denotes theindicator function), or ω(Xt, x) = I(βtXt ≤ x) are also studied.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 174: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

A Gaussian limit distribution for Rn,ω is obtained and theasymptotic distribution of Rn,ω can be approximated with aWild-Bootstrap process:

R∗n,ω(x, θ

∗) = n−1/2n∑

t=1

(Y ∗t −mθ∗(Xt))ω(Xt, x)

where Y ∗t = mθ(Xt) + (Yt −mθ(Xt))Vt, being Vt i.i.d. with zero

mean and unit variance. The Bootstrap regression parameter θ∗ isestimated with (Xt, Y

∗t )1≤t≤T .

Some other references on this topic are Escanciano (2006),Escanciano (2009) and Escanciano and Velasco (2006a, 2006b).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 175: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

Testing the covariance with spectral methods

On the time series setting, Paparoditis (2000) and Fan and Zhang(2004) proposed goodness-of-fit test for the spectral density.Consider Xt t = 1, . . . , N stationary time series, with zero meanand autocovariance function

γ(u) = Cov(Xt, Xt+u)

The Fourier Transform of the autocovariance function is thespectral density, so testing a model for the covariance function isequivalent to test a model for the spectral density, or thelog-spectral density.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 176: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

A well-known nonparametric estimator for the spectral density isthe periodogram:

I(ωk) =1

2πN

∣∣∣∣∣N∑

t=1

Xte−itωk

∣∣∣∣∣

2

ωk = 2πk/N (k = 1, . . . , n = [(N − 1)/2]),

where ωk denote the Fourier frequencies. When Xt can berepresented as a linear sequence, the periodogram can be writtenas:

I(ωk) = f(ωk)Vk +Rn,k

where f(ωk) is the spectral density of the process at the Fourierfrequency ωk, Vk are independent exponentially distributed randomvariables and Rn,k is an asymptotically negligible term.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 177: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

Applying logarithms in the previous expression of the periodogramwe obtain:

Yk = m(ωk) + zk + rk

where:

Yk is the log-periodogram at ωk,

m(·) = log f(·) is the log-spectral density of Xt,

zk are i.i.d. log(Exp(1)),

rk = log

(1 +

Rn,k

f(ωk)Vk

).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 178: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

The goal is to test whether or not the spectral density of anobserved time series belongs to a certain parametric family. Theproblem can be formulated as testing the hypothesis:

H0 : f(·) = fθ(·)Ha : f(·) 6= fθ(·)

which is asymptotically equivalent to

H0 : m(·) = mθ(·)Ha : m(·) 6= mθ(·)

Paparoditis (2000) considers the previous model and proposed atest based on a integrated squared deviation criteria:

TP = Nh1/2∫ π

−π

(1

Nh

n∑

k=−n

K

(ωk − λ

h

)(I(ωk)

fθ(ωk)− 1

))2

dλ,

where K is a kernel function and h is the bandwidth parameter.W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 179: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

This statistic is asymptotically Normal distributed with mean,independent of the model:

µh = h−1/2

∫ π

−πK2(u)du,

and variance

σ2 =1

π

∫ 2π

−2π

(∫ π

−πK(u)K(u+ x)du

)2

dx.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 180: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

The test proposed in Fan (2004) considers the representation forthe log-periodogram. This test is based on a generalized likelihoodratio test, introduced in Fan et al. (2001). The loglikelihoodfunction associated with the previous model is:

n∑

k=1

(Yk −m(ωk)− eYk−m(ωk)

).

In order to construct a generalized likelihood ratio test, twoestimations of the log-spectral density (parametric andnonparametric approaches) must be considered.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 181: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

Parametric approach:

Choose θ, the maximizer of:

n∑

k=1

(Yk −mθ(ωk)− eYk−mθ(ωk)

).

(Whittle’s negative loglikelihood.)

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 182: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

Non-parametric approach:

For any x, approximate m(ωk) by a linear function:

n∑

k=1

(Yk − a− b(ωk − x)− eYk−a−b(ωk−x)

)Kh(ωk − x).

The local maximum likelihood estimator mLK(x) of m(x) is a, inthe maximizer (a, b).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 183: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in time series

Generalized likelihood ratio test statistic

TLK =n∑

k=1

(eYk−m

θ(ωk) +mθ(ωk)− eYk−mLK(ωk) − mLK(ωk)

).

This statistics is asymptotically normally distributed. In practice, aparametric bootstrap approach is used in order to compute thecorresponding p-values.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 184: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

Spatial data

In the spatial statistics setting, as well as for time series analysis,the dependence structure can be modeled by the covariogram orthe variogram, regarding the stationarity properties of the process.Consider Z(s), s ∈ D ⊂ R. Usually, some stationarityassumptions are required.

Second-order stationarity: E(Z(s)) = µ ∀s ∈ D, Cov(Z(s1), Z(s2)) = C(s1 − s2) ∀s1, s2 ∈ D. (Covariogram

or autocovariance function)

Intrinsic stationarity: E(Z(s)) = µ ∀s ∈ D, V ar(Z(s1)− Z(s2)) = 2γ(s1 − s2) ∀s1, s2 ∈ D. (2γ is the

variogram)

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 185: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

The variogram is used to measure the dependence structure of aspatial process. When working with a second order stationarityprocess, we can use the relationship:

2γ(s) = 2(C(0)− C(s)).

Consider Z(s1), . . . , Z(sn), sj ∈ D ⊂ R2. Many different

techniques have been proposed in order to estimate the variogram.Nevertheless, the development of goodness-of-fit testing techniquesfor assessing whether a variogram model is appropriate fordescribing the dependence structure of a dataset has not beendeveloped thoroughly.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 186: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

Testing independence (Diblasi and Bowman, 2001)

H0 : γ(·) = σ2

Ha : γ(·) 6= σ2

Consider the following set of pairs:

|si − sj |, |Z(si)− Z(sj |1/2 = hij , dij

Denote by d the mean value of the dij , i < j. Estimate thevariogram as a weighted sum of the quantities above:

γ(h) =∑

i<j

wijdij

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 187: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

A particularly simple form of smoothing is defined by a local mean,with weights:

wij ∝1

bK

(h− hij

b

)

The test statistic:

T =

∑i<j(dij − d)2 −∑i<j(dij − dij)

2

∑i<j(dij − dij)2

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 188: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

A real dataset:

5 10 15 20

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

Distance

Squ

are−

root

diff

eren

ce

Coalash data in mining samples.

Square-root absolute value scale.

Pointwise confidence intervals.

H0 : γ(·) = σ2

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 189: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

Testing a valid model for the variogram (Diblasi and Maglione,2005) It is a generalization of the Diblasi and Bowman test forindependence.Testing hypothesis:

H0 : γ(·) = γ0(·)Ha : γ(·) 6= γ0(·)

Denote by:Rij = Z(sj)− Z(si)

The variables |Rij |1/2 are frequently used to analyze thecorrelation for he variables Z(s) in an exploratory approach.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 190: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

They are resistant to the influence of outliers and approximatelynormal distributed. Transform these variables to have zero mean,and rewrite the indexes (i, j) as k:

Sk = |Rk|1/2 − E0(|Rk|1/2),

where the indexes k are obtained by a bijection between the set ofall pairs of indexes (i, j) of observed locations (si, sj) with i 6= jand a set of positive integers 1, 2, . . . , n(n− 1)/2. This indextransformation also affects the Rij , which can be written as Rk

and denote by hk = si − sj .

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 191: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

Define the smoothed variables

Sk =N∑

r=1

wr(hk)Sk

where the weights are chosen as:

wr(hk) =exp

(−hk−hr

b

)2

∑Nr=1 exp

(−hk−hr

b

)2 .

The test statistic:

T =

∑Nk=1(Sk − aγ0(hij)

1/4)2 −∑Nk=1(Sk − Sk)

2

∑Nk=1(Sk − Sk)2

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 192: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

Testing a valid family for the covariogram

We want to test whether the variogram/covariogram of anobserved spatial process belongs to a certain parametric family. Forsecond order stationary processes, this problem can be rewritten interms of the spatial spectral density.

As it happens in time series context, if the process is second-orderstationarity, then the covariogram admits a Fourier Transform.This Fourier Transform is the spatial spectral density:

f(ω) =1

(2π)2

∫ ∫C(s)e−isTω.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 193: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

For data observed on a regular grid n1 × n2, the spatialperiodogram is defined as:

I(ωk) =1

(2π)2· 1

N

∣∣∣∣∣n1∑

s1=1

n2∑

s2=1

Z(s)e−isTωk

∣∣∣∣∣

2

,

and it is usually computed at the set of bidimensional Fourierfrequencies:

ωk = (ωk1 , ωk2)

ωkj =2πkjnj

kj = 0, 1, . . . , [(nj − 1)/2], j = 1, 2.

Our main goal is to test whether the covariogram C belongs to acertain parametric family. That is:

H0 : C = Cθ ∈ Cθ, θ ∈ ΘHa : C 6= Cθ ∈ Cθ, θ ∈ Θ.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 194: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

Equivalently, the problem can be formulated in terms of thespectral density as:

H0 : f = fθ ∈ fθ, θ ∈ Θ,Ha : f 6= fθ ∈ fθ, θ ∈ Θ

which is asymptotically equivalent to

H0 : m = mθ ∈ mθ, θ ∈ Θ,Ha : m 6= mθ ∈ mθ, θ ∈ Θ.

The test proposed in Paparoditis (2000) and Fan (2004) can beextended to the spatial setting. We will assume that our processcan be written as a linear sequence.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 195: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

A first test statistic, based on the spatial periodogram, is given by:

T1 = N |H|1/4∫

Π2

(1

N |H|1/2∑

k

K(H−1/2(ω − ωk))

(I(ωk)

fθ(ωk)

)− 1

)2

dω,

where K is a bidimensional kernel and H is a bandwidth matrix.The sum in the integrand extends over the Fourier frequencies.Gaussian asymptotic normality of this statistic can be alsoobtained.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 196: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

A different approach is based on the representation of the spatiallog-periodogram as:

Yk = m(ωk) + zk + rk.

The loglikelihood function associated with (13) is:∑

k

(Yk −m(ωk)− eYk−m(ωk)

).

Parametric approach:Choose θ, the maximizer of:

k

(Yk −mθ(ωk)− eYk−mθ(ωk)

).

Problem: this approach does not provide consistent estimates fordimensions d ≥ 2, but this problem can be overcome replacing theperidogram by a tapered or a smooth version (see Guyon (1982) orDahlhaus and Kunsch (1987)).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 197: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

Non-parametric approach:For any x, approximate m(ωk) by a linear function:

k

(Yk − a− b

T (ωk − x)− eYk−a−bT (ωk−x)

)KH(ωk − x),

where K is a bidimensional kernel and H denotes the bandwidthmatrix.The local maximum likelihood estimator mLK(x) of m(x) is a, inthe maximizer (a, b).Generalized likelihood ratio test statistic

TLK =∑

k

(eYk−m

θ(ωk) +mθ(ωk)− eYk−mLK(ωk) − mLK(ωk)

).

This statistics is asymptotically normally distributed. In practice, abootstrap approach is used in order to compute the correspondingp-values.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 198: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

Testing a valid family for the covariogram

Consider Zl(s), s ∈ Dl, with l = 1, . . . , L, realizations of aspatial stochastic process (for instance, realizations taken on Ltime moments) or L realizations of different spatial processes.

The purpose in Crujeiras et al. (2006) and Crujeiras et al. (2007) isto test whether the dependence structure of Zl, l = 1, . . . , L isthe same. In terms of the log-spectral densities ml, the testingproblem can be written as H0 : m1 = . . . = mL vs. Ha : ml 6= mj ,for some l 6= j.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 199: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

In this context, the comparison can be made by consideringnonparametric estimators of the spatial log-spectral densities.Consider the following test statistic, based on a L2-distance:

Q =L∑

l=2

l−1∑

j=1

(∫

Π2

(ml(ω)− mj(ω))2 ω(ω)dω

) ,

where ω is a positive, bounded weight function with supportΠ2.This weight function is usually chosen to avoid edge-effects. Inthe spectral context, this function can be chosen in order to filterfrequencies where the periodogram presents higher variability, asthe origin or those frequencies with π-valued components.

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 200: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

For the particular case of comparing two spatial processes, thetesting problem is formulated as:

H0 : m1(·) = m2(·),Ha : m1(·) 6= m2(·)

and assume that both Z1 and Z2 have been observed on grids withthe same design. This implies that the corresponding Fourierfrequencies are the same in both cases. By Riemann’sapproximation, the test statistic Q can be approximated by:

Q =(2π)2

N

k

(m1(ωk)− m2(ωk))2 ω(ωk). (1)

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 201: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

Then, under the null hypothesis that H0 : m1 = m2, we have that,as N → ∞:

√N2|H|1/2

(Q− (2π)4

12N |H|1/2CKIω

)→ N(0, σ2

Q), (2)

in distribution, with

CK =

∫K2(u)du, Iω =

∫ω(m)dm and the asymptotic variance is

σ2Q =

(2π)8

72

∫(K ∗K)2(u)du

∫ω2(m)dm,

where ∗ denotes the convolution operator. Since the convergenceto the Gaussian limit is slow, the authors propose a Bootstrapprocedure for p-value calibration (see Crujeiras et al. (2007)).

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models

Page 202: Departamento de Estad´ıstica e Investigaci´on Operativa …eio.usc.es/eipc1/base/BASEMASTER/FORMULARIOS-PHP... · 2014. 2. 12. · Gouri´eroux and Tenreiro (2001) Neumann and

GoF for regression models

Related setups, extensions and open problems.

Tests in spatial data

Thank you for your attention!!!Wenceslao [email protected]

Rosa M. [email protected]

W. Gonzalez–Manteiga, R.M. Crujeiras GoF for regression models