dimension reduction based on the hellinger integral -...

14
Biometrika (xxxx), xx, x, pp. 1–14 Advance Access publication on xx xxxx C 2012 Biometrika Trust Printed in Great Britain Dimension reduction based on the Hellinger integral BY QIN WANG Department of Statistical Sciences and Operations Research, PO Box 843083 Virginia Commonwealth University, Richmond, Virginia 23284, U.S.A. [email protected] 5 XIANGRONG YIN Department of Statistics, University of Georgia, 101 Cedar Street, Athens, Georgia 30602, U.S.A. [email protected] AND FRANK CRITCHLEY 10 Department of Mathematics and Statistics, The Open University, Milton Keynes MK7 6AA, U.K. [email protected] SUMMARY Sufficient dimension reduction is a useful tool to study the dependence between a response 15 and a multidimensional predictor. Here, a new formulation is proposed based on the Hellinger integral of order two, introduced as a natural measure of the regression information contained in a predictor subspace. The response may be either continuous or discrete. Links between local and global central subspaces, of some independent interest, are established. Exploiting these, a local estimation algorithm is developed and tested on both simulated and real data. Relative to 20 sliced regression, widely regarded as among the best existing methods, its overall performance is broadly comparable, sometimes better. Computationally, it is much more efficient, allowing larger problems to be tackled. Some key words: Central subspace; Hellinger integral; Local central subspace; Sufficient dimension reduction. 1. I NTRODUCTION 25 Sufficient dimension reduction seeks a low-dimensional subspace of a × 1 predictor vector without loss of information about the regression of a response on , and without pre- specifying a model for any of , or (, ). Such a reduced predictor space is called a dimension reduction subspace. We assume throughout that the intersection of all such spaces is itself a dimension reduction subspace, as holds under very mild conditions (Cook, 1998a; Yin 30 et al., 2008). This intersection, called the central subspace (Cook, 1994, 1996), becomes the natural focus of inferential interest. Its dimension ∈{0, ..., } is called the structural dimension of the regression. Dimension reduction methods typically seek a small number of linear combinations of to optimize a suitable target function phrased in terms of moments, as in (Li, 1991; Xia et al., 35 2002), or density or characteristic functions, as in (Hern´ andez et al., 2005; Zhu & Zeng, 2006).

Upload: dangque

Post on 09-Jan-2019

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dimension reduction based on the Hellinger integral - Statisticsstatistics.open.ac.uk/802576CB00593013/(httpInfoFiles... · Biometrika (xxxx), xx, x, pp. 1–14 ⃝C 2012 Biometrika

Biometrika(xxxx), xx, x, pp. 1–14Advance Access publication on xx xxxxC⃝ 2012 Biometrika Trust

Printed in Great Britain

Dimension reduction based on the Hellinger integral

BY QIN WANG

Department of Statistical Sciences and Operations Research, PO Box 843083Virginia Commonwealth University, Richmond, Virginia 23284, U.S.A.

[email protected] 5

X IANGRONG Y IN

Department of Statistics, University of Georgia, 101 CedarStreet,Athens, Georgia 30602, U.S.A.

[email protected]

AND FRANK CRITCHLEY 10

Department of Mathematics and Statistics, The Open University,Milton Keynes MK7 6AA, U.K.

[email protected]

SUMMARY

Sufficient dimension reduction is a useful tool to study the dependence between a response 15

and a multidimensional predictor. Here, a new formulation is proposed basedon the Hellingerintegral of order two, introduced as a natural measure of the regression information containedin a predictor subspace. The response may be either continuous or discrete. Links between localand global central subspaces, of some independent interest, are established. Exploiting these, alocal estimation algorithm is developed and tested on both simulated and real data. Relative to 20

sliced regression, widely regarded as among the best existing methods, its overall performanceis broadly comparable, sometimes better. Computationally, it is much more efficient,allowinglarger problems to be tackled.

Some key words: Central subspace; Hellinger integral; Local central subspace; Sufficient dimension reduction.

1. INTRODUCTION 25

Sufficient dimension reduction seeks a low-dimensional subspace of ap× 1 predictor vectorX without loss of information about the regression of a responseY on X, and without pre-specifying a model for any ofY ∣ X, X ∣ Y or (X,Y ). Such a reduced predictor space is calleda dimension reduction subspace. We assume throughout that the intersection of all such spacesis itself a dimension reduction subspace, as holds under very mild conditions (Cook, 1998a; Yin 30

et al., 2008). This intersection, called the central subspaceSY ∣X (Cook, 1994, 1996), becomesthe natural focus of inferential interest. Its dimensiondY ∣X ∈ {0, ..., p} is called the structuraldimension of the regression.

Dimension reduction methods typically seek a small number of linear combinations of X tooptimize a suitable target function phrased in terms of moments, as in (Li, 1991; Xia et al., 35

2002), or density or characteristic functions, as in (Hernandez et al., 2005; Zhu & Zeng, 2006).

Page 2: Dimension reduction based on the Hellinger integral - Statisticsstatistics.open.ac.uk/802576CB00593013/(httpInfoFiles... · Biometrika (xxxx), xx, x, pp. 1–14 ⃝C 2012 Biometrika

2 Q. WANG, X. Y IN AND F. CRITCHLEY

Since the first method, sliced inverse regression (Li, 1991), was introduced, sufficient dimen-sion reduction has been an active research field: for recent reviews, see Yin (2010) and Ma &Zhu (2013). Its methods can be broadly categorized into three groups, according to the randomvector focused upon. Other inverse regression methods focusing onX ∣ Y include sliced aver-40

age variance estimation (Cook & Weisberg, 1991), principal Hessian directions (Li, 1992; Cook,1998b),ktℎ moment estimation (Yin & Cook, 2002) and contour regression (Li et al., 2005).While computationally inexpensive, such methods require either or both of thekey linearityand constant covariance conditions (Cook, 1998a). An exhaustiveness condition may also be re-quired to ensure recovery of the whole central subspace. Average derivative estimation (Hardle45

& Stoker, 1989; Samarov, 1993), minimum average variance estimation (Xia et al., 2002) andsliced regression (Wang & Xia, 2008) are examples of forward regression methods, focusing onY ∣ X. While free of strong assumptions, the computational burden of these methods increasesdramatically with either sample size or the number of predictors, due to the use ofnonparametricestimation. Including those based on Kullback-Leibler divergence (Yin & Cook, 2005; Yin et al.,50

2008), Fourier analysis (Zhu & Zeng, 2006) and integral transforms (Zeng & Zhu, 2010), jointmethods focusing on(X,Y ) are intermediate.

This paper describes a new density-based joint approach targeting the central subspace byexploiting a novel characterization of dimension reduction subspaces in terms of the Hellingerintegral. More fully, the Hellinger integral of order two. The main assumptionsinvolved are very55

mild: existence ofSY ∣X and a finiteness condition on the Hellinger integral. Accordingly, thisapproach is more flexible than many others. In particular the response, here taken as univariate,may be continuous or discrete. Estimation may be done either globally or locally. Exploitinglinks between global and local central subspaces, established under amild regularity condition,we focus on a local estimation algorithm based on a random sample{(xi, yi), i = 1, . . . , n}60

from the joint distributionF(X,Y ). Relative to sliced regression, widely regarded as among thebest existing methods, its overall performance is found to be broadly comparable, sometimesbetter. Computationally, it is much more efficient, allowing larger problems to be tackled. Matlabcode is available upon request.

Straightforward proofs are omitted; more detailed ones are in the Appendix.For brevity, the65

same notation is used for continuous and discreteY . We write as if the former were the case,while leaving all differential terms implicit. For example, we write

∫p(x, y) = 1 in both cases,

integration/summation being over the support of the integrand unless otherwise indicated. ThenotationW1⊥⊥W2 ∣W3 means that the random vectorsW1 andW2 are independent given anyvalue ofW3. For any vectorx, its Euclidean orthogonal projection onto a subspaceS is denoted70

xS . The direct sum of two subspaces{x1 + x2 ∣ x1 ∈ S1, x2 ∈ S2} is denotedS1 ⊕ S2. For anypositive definite symmetric matrixP , its unique square root enjoying the same properties isdenotedP 1/2, with inverseP−1/2.

2. THE HELLINGER INTEGRAL OF ORDER TWO

2⋅1. Definition75

Throughout,Xu denotes the reduced predictor vectoruTX, taking observed valuesxu = uTx,where the non-random matrixu is either the zero vector0p or hasp rows and full column rankd. We study the Hellinger integral of order two (Vajda, 1989) defined herefor each suchu by

H(u) = E{R(Xu;Y )},

Page 3: Dimension reduction based on the Hellinger integral - Statisticsstatistics.open.ac.uk/802576CB00593013/(httpInfoFiles... · Biometrika (xxxx), xx, x, pp. 1–14 ⃝C 2012 Biometrika

Hellinger dimension reduction 3

whereR(xu; y) is the dependence ratio

p(xu, y)

p(xu)p(y)=

p(y ∣ xu)p(y)

=p(xu ∣ y)p(xu)

.

The expectation is over the joint distribution, as can be emphasized by writingH(u)more fully as 80

H(u;F(X,Y )), and is assumed finite for all suchu. We know of no departures from this finitenesscondition which are likely to occur in statistical practice, if only because of errors of observation.For example, if(X,Y ) is bivariate normal with correlation�, H(1) =

(1− �2

)−1becomes

infinite in either singular limit�→ ±1 but, then,Y is a deterministic function ofX. We assumethatH(u) varies continuously withu for each fixedd > 0, as holds for suitably smoothp(x, y). 85

2⋅2. PropertiesA variety of properties indicate thatR andH are well-adapted to our purposes. Both can be

viewed in terms of mutual dependence ofY andXu, or of forwards or inverse regression, therelationH(u) =

∫p(y ∣ xu)p(xu ∣ y) showing thatH integrates information from the latter two.

Again, key properties of the central subspace (Cook, 1998a) are mirrored locally in R 90

and, hence, globally inH. Its invarianceSY∗∣X = SY ∣X under one-to-one transformationY →Y∗ of the response is reflected inR(xu; y∗) = R(xu; y) and H(u;F(X,Y )) = H(u;F(X,Y∗)).Equally, the relationSY ∣X∗

= A−1SY ∣X under nonsingular affine transformationX → X∗ =

ATX + b of the predictors is reflected inR(xu; y) = R((A−1u)Tx∗; y) andH(u;F(X,Y )) =

H(A−1u;F(X∗,Y )). The same affine equivariance property holding for our estimates ofSY ∣X , 95

there is no loss in taking the predictors to have identity covariance throughout.Further,R(xu; y) and, hence,H(u) depend onu only via the subspace spanned by its columns,

the same therefore holding for the�2 measure of divergence from independence:

�2(u) =

∫ [{p(xu, y)− p(xu)p(y)}2p(xu)p(y)

]= E

[{R(Xu;Y )− 1}2

R(Xu;Y )

].

This accords with the fact that primary interest lies in the functions of a general subspaceS of ℝp

which they therefore induce:ℋ(S) = H(u) andX 2(S) = �2(u), whereS = Span(u). In terms 100

of optimizingℋ over subspaces of fixed dimensiond ∈ {0, ..., p}, there is no loss in restrictingattention to orthonormal bases for them: that is, to the relevant Stiefel manifold, denoted hereUdand identified with{0p} (d = 0) or

{all p× d matricesu with uTu = Id

}(d > 0).

Finally, there are clear links with departures from independence. Globally,Y ⊥⊥Xu if and onlyif R(xu; y) = 1 for every supported(xu, y), departures from unity at a particular point indicating105

local dependence there. Following Vajda (1989, p. 229 & 246),E[{R(Xu;Y )}−1

]= 1 gives

H(u)− 1 = �2(u) ≥ 0, (1)

equality holding if and only ifY ⊥⊥Xu.Of central importance here, these links extend naturally to the conditional case. As we may,

with Si = Span(ui) (i = 1, 2) and

R(xu2 ; y ∣ xu1) = p(xu2 , y ∣ xu1)/{p(xu2 ∣ xu1)p(y ∣ xu1)},we putℋ(S2 ∣ S1) = H(u2 ∣ u1) = EXu2 ∣(Xu1 ,Y ){R(Xu2 ;Y ∣ Xu1)} and 110

X 2(S2 ∣ S1) = �2(u2 ∣ u1) = E(X,Y )

(R(Xu1 ;Y )

[{R(Xu2 ;Y ∣ Xu1)− 1}2

R(Xu2 ;Y ∣ Xu1)

]).

Page 4: Dimension reduction based on the Hellinger integral - Statisticsstatistics.open.ac.uk/802576CB00593013/(httpInfoFiles... · Biometrika (xxxx), xx, x, pp. 1–14 ⃝C 2012 Biometrika

4 Q. WANG, X. Y IN AND F. CRITCHLEY

SinceH(0p) = 1 while �2(u ∣ 0p) = �2(u), together with the remark following it, (1) is thespecial caseS1 = {0p} andS2 = Span(u) of the following general result.

THEOREM 1. LetS1 andS2 be subspaces ofℝp meeting only at the origin. Then:

ℋ(S1 ⊕ S2)−ℋ(S1) = X 2(S2 ∣ S1) ≥ 0,

equality holding if and only ifY ⊥⊥XS2 ∣XS1 .

Theorem 1 establishesℋ(S) as a natural measure of the amount of information about the115

regression ofY on X contained in a subspaceS, being strictly increasing withS except onlywhen, conditionally on the dependence information already contained, additional dimensionscarry no further information. This property of the Hellinger integral establishes its link withdimension reduction subspaces, as we now discuss.

2⋅3. Links with dimension reduction subspaces120

The following result uses the Hellinger integral to characterize dimension reduction subspacesand, thereby, the central subspaceSY ∣X = Span(�), where� ∈ UdY ∣X

. In it, for eachd = 0, ..., p,Hd denotes the maximum ofH(u) overUd, this maximum existing since, by hypothesis,H iscontinuous on the compact setUd.

THEOREM 2. We have:125

1.ℋ(S) ≤ ℋ(ℝp) for every subspaceS ofℝp, equality holding if and only ifS is a dimensionreduction subspace (that is, if and only ifS ⊇ SY ∣X ).

2. All dimension reduction subspaces contain the same, full, regression informationH(Ip) =H(�), the central subspace being the smallest dimension subspace with this property.

3.SY ∣X uniquely maximizesℋ(⋅) over all subspaces of dimensiondY ∣X .130

4. Hd is strictly increasing with subspace dimensiond over{0, ..., dY ∣X

}and constant there-

after.

Theorem 2 has useful implications for exhaustive estimation of the central subspace. Part 3’scharacterization establishes that maximizingH(⋅) overUdY ∣X

recovers a basis for it. In the usualcase wheredY ∣X is unknown, part 4 motivates seeking anH-optimal�(d) for increasingd until135

d = dY ∣X can be inferred. A permutation test procedure analogous to that in Section4⋅4 belowcan be used for this purpose,Hd+1 − Hd being a suitable test statistic for testingH0: dY ∣X =d againstHa: dY ∣X > d. A practical procedure involves, then, replacingH(u) by the sampleaverage of density-estimatedR(xu; y) values at each data point, maximizing the result oversuccessive Stiefel manifoldsUd via an appropriate algorithm. At the same time, the associated140

computational cost suggests the merit of developing a complementary methodology based onfast, local computation. For brevity, attention is now restricted to this alternative approach whichapplies the above results to local central subspaces, introduced next.

3. LINKS BETWEEN GLOBAL AND LOCAL CENTRAL SUBSPACES

Local central subspaces are central subspacesSYL∣XLfor (XL, YL), the random vector arising145

when(X,Y ) is localised by conditioning upon(X,Y ) ∈ L for an appropriate subsetL of itssupportΩ. WhenL = Ω, we recover the global spaceSY ∣X .

These local spaces exist, and have a natural characterisation, for any L ⊆ Ω that satisfies atechnical regularity condition, seen to be mild. TheX andY projections ofL are denotedLX

andLY , with pL(y ∣ x) the conditional density ofYL ∣ XL = x andLX∣y = {x : (x, y) ∈ L}.150

Page 5: Dimension reduction based on the Hellinger integral - Statisticsstatistics.open.ac.uk/802576CB00593013/(httpInfoFiles... · Biometrika (xxxx), xx, x, pp. 1–14 ⃝C 2012 Biometrika

Hellinger dimension reduction 5

We callL ⊆ Ω with∫Lp(x, y) > 0 regular if: (a)LX and, for eachy ∈ LY , LX∣y is open and

convex; and (b)pL(y ∣ x) is differentiable with respect tox at each(x, y) ∈ L.Regularity is, indeed, a mild condition, (b) being a minimal smoothness requirement, while

(a) holds ifL is either (a1) open and convex, such as the interior of an ellipsoid, or (a2) a Carte-sian productLX × LY with LX open and convex,LY being unconstrained. The neighbourhoods155

used in the sequel, according asY is continuous or discrete, satisfy conditions (a1) and (a2) re-spectively: see Section 4⋅2. WhenLY = ΩY in the latter case, we callL anX-only localization.

Regularity allows appropriate use of the fundamental theorem of integral calculus in the proofof the following lemma. Omitted for brevity, it suitably amends the proof of a correspondingresult for central mean subspaces in Zhu & Zeng (2006). PuttingL = Ω andpL(y ∣ x) = p(y ∣ 160

x), the lemma here specializes at once toSY ∣X . As is intuitive, we have

LEMMA 1. For any regularL, SYL∣XL= Span {∂pL(y ∣ x)/∂x : (x, y) ∈ L}.

This characterisation provides a generic link betweenSY ∣X and its local central subspaces.

THEOREM 3. For any regularL andΩ, SYL∣XL⊆ SY ∣X .

There being no intrinsic reason why localization should affect structuraldimension, Theorem 3 165

typically holds with equality as, trivial cases apart, it must for any single-index model.Again, for any collectionℒ of regular setsL, denoting bySℒ the induced direct sum of local

central subspaces⊕L∈ℒ SYL∣XL, Theorem 3 gives at once

Sℒ ⊆ SY ∣X . (2)

When∪L∈ℒL = Ω, we callℒ a regular covering ofΩ, equality then being expected in (2) in thatit is only the possibility of Simpson’s paradox which prevents this automatic conclusion. Indeed, 170

COROLLARY 1. For any regular covering of a regularΩ byX-only localisations,Sℒ = SY ∣X .

4. ESTIMATION

4⋅1. OverviewPopulation results from Sections 2 and 3 are applied here in the sample, our focus being a fast,

local algorithm to estimateSY ∣X . Whereas establishing general conditions for exhaustiveness175

remains an important and challenging problem for future study, as does developing the associatedasymptotics, the underlying ideas are intuitive and clear. There are points of contact with averagederivative estimation (Hardle & Stoker, 1989; Samarov, 1993) and outerproduct of gradientsestimation (Xia et al., 2002).

In outline, our estimation procedure is as follows. By Theorem 2, for every regularL ⊆ Ω, 180

maximizing the local Hellinger integralHL(u) = H(u;F(XL,YL)) overu ∈ UdL , dL = dYL∣XL,

will provide a basis forSYL∣XL. WhereasdL = dY ∣X is typically expected, for computational and

statistical efficiency, we estimate a single, Hellinger-dominant, direction locally toeach samplepoint. That is, we focus onn one-dimensional local optimizations, one for each member of aregular coveringℒn = {Li : i = 1, ..., n} of the empirical support. Each is very fast, requiring185

only calculation of the dominant eigenvectorui maximizing a suitable approximation toHLi(u)

overu ∈ U1. Pooling these to formM = n−1∑n

i=1 uiuTi , a permutation test procedure is used to

find dM estimating the dimensiondM of the span of the corresponding population kernel matrixM , exhaustive recovery corresponding todM = dY ∣X . Finally, we estimateSY ∣X as the span of

thedM dominant eigenvectors ofM , this being equivariant under affine predictor transformation.190

Page 6: Dimension reduction based on the Hellinger integral - Statisticsstatistics.open.ac.uk/802576CB00593013/(httpInfoFiles... · Biometrika (xxxx), xx, x, pp. 1–14 ⃝C 2012 Biometrika

6 Q. WANG, X. Y IN AND F. CRITCHLEY

This core algorithm, detailed below, performs well: see Section 5. In particular, exhaustive-ness, which arises precisely when the Hellinger-dominant directions spanthe central subspace,is typically achieved. The intuition for this is that a regular covering usually gives equality in (2),as discussed above, while a transparent example illustrates each directionin the global centralsubspace being Hellinger-dominant locally to some supported point. Letu ∈ U1 andX(1), X(2)195

and� be independently standard normal withY = X2(1) +X2

(2) + �2, so thatdY ∣X = 2; that is,

SY ∣X contains all directions inℝ2. Then, locally to any(x, y), HL(u) behaves likeR(xu; y)which, in turn, is easily seen to be maximized overu ∈ U1 by±x/ ∥x∥, as is intuitive.

Again, the joint localization used for continuous responses can help filter out effects of theirpossible extreme values, improving estimation accuracy: Model I in Section 5⋅1 illustrates this.200

Variations and extensions of the core algorithm are briefly indicated, belowand in Section 6.

4⋅2. Additional detailsUnder the mild smoothness requirement (b) of Section 3, for given(x0, y0) andr > 0, we

define the regular subsetL0(r) as follows. For continuous responses,L0(r) is the open Euclideanball centred on(x0, y0) with radiusr. For discrete responses, smoothing being meaningful only205

on the predictors,L0(r) is the Cartesian product of the open Euclidean ball centred onx0 withradiusr and the singleton{y0}. In both cases, asr → 0,

HL0(r)(u) ∼ R(xu,0; y0) (3)

wherexu,0 = uTx0. For eachi = 1, ..., n, Li is the special case ofL0(r) with (x0, y0) =(xi, yi), r = ri(k) being chosen so that the open ball involved contains exactlyk sample pointsadditional to its centre, their index set being denoted byNi. This number of nearest neighbours210

k plays the role of a tuning parameter. A single direction inℝp being estimated from eachLi,

we use the rule-of-thumb choicek/p = 4, halving this value for continuous responses takingfrequent extreme values; this latter arises here solely with Model I in Section 5⋅1. More refinedchoices ofk, such as cross-validation, are possible at greater computational expense.

The population kernel matrix isM = E(X,Y ){u(X,Y )u(X,Y )T } where u(x, y) is the215

Hellinger-dominant direction at(x, y). If R(xu; y) = 1 for eachu ∈ U1, we formally putu(x, y) = 0p. Otherwise,u(x, y) = ± argmax{R(xu; y) : u ∈ U1}, assumed unique. Thus,exhaustiveness corresponds toSpan(M) = SY ∣X . In practice, we estimateM by M =

n−1∑n

i=1 uiuTi , ui being as follows.

The local approximationRi(u) = R(uTxi; yi) to HLi(u) given by (3) only requires density220

estimation at a single point. Recalling thatR(uTx; y) is invariant to translation ofx or y, thismay be taken as the relevant origin. The estimateRi(u), u ∈ U1, is detailed below. It meets twocriteria: (i) for given(xi; yi), it depends only on the observations indexed byNi, these beingregarded as a sample of sizek from a localised conditional distribution. And: (ii) it is readilymaximized overU1. Specifically,Ri(u) depends onu only via uTFiu/u

TGiu, being a strictly225

increasing function of this ratio. The matricesFi andGi are as follows. WhenY is continuous,

Fi = k−1∑j∈Ni

(yj − yi)2(xj − xi)(xj − xi)

T , Gi = k−1∑j∈Ni

(xj − xi)(xj − xi)T . (4)

WhenY is discrete, withN ′i denoting{j ∈ Ni : yj = yi} of sizek′i, independent ofu,

Fi = k−1∑j∈Ni

(xj − xi)(xj − xi)T , Gi = (k′i)

−1∑j∈N ′

i

(xj − xi)(xj − xi)T , (5)

wherek′i ≥ p is required for nonsingularity ofGi. If Gi is singular or, suggestingdLi= 0,N ′

i =Ni, we discard the corresponding case from the analysis, formally puttingui = 0p. In all othercases, the optimalui is readily computed as the dominant unit eigenvector ofG−1

i Fi.230

Page 7: Dimension reduction based on the Hellinger integral - Statisticsstatistics.open.ac.uk/802576CB00593013/(httpInfoFiles... · Biometrika (xxxx), xx, x, pp. 1–14 ⃝C 2012 Biometrika

Hellinger dimension reduction 7

It follows directly that the resulting estimate ofSY ∣X has the affine equivariance propertyclaimed in Section 2⋅2. Recall that, denoting the observed predictor vectors by{xobsi }ni=1 withempirical covarianceV , estimation is based on their standardized formsxi = V −1/2xobsi , a finalback-transformation providing the more interpretableSY ∣Xobs = V −1/2SY ∣X . We have

PROPOSITION1. For any nonsingularA, xobsi → xobsi∗ = ATxobsi + b (i = 1, ..., n) induces 235

SY ∣Xobs → SY ∣Xobs∗

= A−1SY ∣Xobs .

4⋅3. Estimation ofRi(u)

The estimateRi(u), u ∈ U1, meeting criteria (i) and (ii) above combines a number of moreor less standard kernel smoothing results. We refer the reader to Wand &Jones (1995) forfurther details of these. Here,K denotes a symmetric univariate kernel density with vari-240

ance� =∫t2K(t)dt > 0 and scaled versionKℎ(t) = ℎ−1K(t/ℎ), ℎ > 0. Specifically, we take

K as the uniform density on∣t∣ ≤ 1. Throughout,Z = Xu − xu,i and W = Y − yi, wherexu,i = uTxi.

Supposing first thatY is continuous, letZ andW have joint densityp(Z,W ), with marginspZandpW . Under sufficient smoothness, Taylor expansion ofpZ around0 gives, for smallℎz > 0, 245

E{g(Z;ℎz)} = pZ(0) +O(ℎ2z), g(Z;ℎz) = (�ℎ2z)−1Z2Kℎz

(Z),

the first order term vanishing by symmetry ofK. Accordingly, meeting the first criterion, we use

pZ(0) = k−1∑j∈Ni

g(zj ;ℎz) and pW (0) = k−1∑j∈Ni

g(wj ;ℎw), (6)

with ℎz = maxj∈Ni∣zj ∣ andℎw = maxj∈Ni

∣wj ∣, in which zj = xu,j − xu,i andwj = yj − yi.An exactly similar bivariate analysis using the scaled product kernelKℎz

(Z)Kℎw(W ), ℎz and

ℎw having the same order, leads top(Z,W )(0, 0) = k−1∑

j∈Nig(zj ;ℎz)g(wj ;ℎw). In this way,

the multiplicative bandwidth terms cancelling, the second criterion is also met by 250

Ri(u) =p(Z,W )(0, 0)

pZ(0)pW (0)= c

uTFiu

uTGiu,

wherec = {k−1∑

j∈Ni(yj − yi)

2}−1 > 0 is independent ofu, Fi andGi being as in (4).Supposing now thatY is discrete, alongsidepZ(0) given by (6), we usepZ∣W=0(0) =

(k′i)−1∑

j∈N ′i

g(zj ;ℎz∣w) with ℎz∣w = maxj∈N ′i∣zj ∣. In this case, withFi andGi as in (5),

Ri,0(u) = pZ∣W=0(0)/pZ(0) = (ℎz∣w/ℎz)−3(uTFiu/u

TGiu)−1

depends on a bandwidth ratio which, not varying smoothly withu, inhibits the desired max-imization. However, denoting standardkth nearest neighbour estimates with a tilde,Ri(u) = 255

pZ∣W=0(0)/pZ(0) = (k′i/k)/(ℎz∣w/ℎz). Combining these two estimates,

Ri(u) =

[{Ri(u)}3Ri,0(u)

]1/2= c′

(uTFiu

uTGiu

)1/2

meets both criteria,c′ = (k′i/k)3/2 > 0 being independent ofu.

4⋅4. A permutation test procedure to determine dimensionalityThe kernel dimension reduction matrixM spans a subspace ofSY ∣X whose dimensiondM

we estimate using a permutation test procedure. This follows Cook & Yin (2001) and Yin & 260

Page 8: Dimension reduction based on the Hellinger integral - Statisticsstatistics.open.ac.uk/802576CB00593013/(httpInfoFiles... · Biometrika (xxxx), xx, x, pp. 1–14 ⃝C 2012 Biometrika

8 Q. WANG, X. Y IN AND F. CRITCHLEY

Cook (2002), where further details may be found. It makes the mild assumption thatdM < p. Inparticular, this holds whenever dimension reduction is possible. That is, wheneverdY ∣X < p.

Pooling theui to form M = n−1∑n

i=1 uiuTi , dimensions signalled across a range of local

subspaces are reinforced, while others are suppressed. Thus, weexpect the firstdM eigenvaluesof M to be significantly larger than the rest, these latter being close to zero and so toeach other.265

We denote the ordered eigenvalues ofM by �1 > ... > �p, with corresponding orthonormaleigenvectors�1, . . . , �p.

Beginning withm = 0, we testH0: dM = m againstHa: dM > m, using as test statistic theobserved value

fobs = �(m+1) −∑p

i=m+2 �i

p− (m+ 1), (m < p− 1)

small values relative to the permutation distribution described below indicating acceptance, large270

values rejection. IfH0 is accepted, we takedM = m. Else, we increasem by 1 and repeat theprocedure, as required. IfH0: dM = p− 2 is rejected, we takedM = p− 1.

Let Cm = (�m+1, . . . , �p) andDm = (�1, . . . , �m), D0 not being used. Sampling variabilityapart, ifdM ≤ m, the predictor vectorsCT

mxi carry no locally detectable regression information.In this case, randomly permuting them viaCT

mxi ← CTmx�(i) while holdingyi andDT

mxi fixed,275

and recomputingM asM�, will result in new test statisticsf� not systematic different fromfobs,f� tending to decrease underHa. Applying J independent random permutations in this way toobtainf�(j), j = 1, ⋅ ⋅ ⋅ , J , we compute the permutation p-value

pperm = J−1∑Jj=1I(f�(j) > fobs),

rejectingH0 if pperm < � for some pre-specified significance level�.

5. EVALUATION280

5⋅1. Simulation studiesHere, we evaluate the performance of the core algorithm by simulation. A variety of existing

methods are used for comparison: sliced inverse regression, sliced average variance estimation,principal Hessian directions, minimum average variance estimation and sliced regression. Thislast method is reported (Wang & Xia, 2008) to have superior accuracy to awide variety of other285

methods.We consider four varied models. Model I, studied by Wang & Xia (2008),has frequent extreme

responses. Model II, used by Zhu & Zeng (2006), has a discrete responseY ∈{0, 1, 2, 3}.Model III, studied in Xia (2007), has central subspace directions in both the regression meanand variance functions. Model IV reflects a situation where only partial response information is290

available, the continuous response of Model III being replaced by a ternary variable indicatingwhich of three ranges it falls in. Specifically, the models are as follows:

Model I: Y = (XT�)−1 + 0.5�,Model II: Y = I[∣XT�1 + 0.2�∣ < 1] + 2I[XT�2 + 0.2� > 0],Model III: Y = 2(XT�1) + 2 exp(XT�2)�,Model IV: Y = 0, 1 or 2 according asY0 ≤ −2,−2 < Y0 < 2 or Y0 ≥ 2,

where Y0= 2(XT�1) + 2 exp(XT�2)�.In all four models,X is a10-dimensional predictor, independent of a standard Gaussian noisevariable�. In Models I and II,X ∼ N10(0, Σ), with Σ = (�ij) = (0.5∣i−j∣). In Models III and295

IV, the components ofX are independent uniform variates on(−√3,√3). In Model I, � =

Page 9: Dimension reduction based on the Hellinger integral - Statisticsstatistics.open.ac.uk/802576CB00593013/(httpInfoFiles... · Biometrika (xxxx), xx, x, pp. 1–14 ⃝C 2012 Biometrika

Hellinger dimension reduction 9

(1, 1, 1, 1, 0, ⋅ ⋅ ⋅ , 0)T . In Model II, �1 = (1, 1, 1, 1, 0, ⋅ ⋅ ⋅ , 0)T and�2 = (0, ⋅ ⋅ ⋅ , 0, 1, 1, 1, 1)T .In Models III and IV,�1 = (1, 2, 0, ⋅ ⋅ ⋅ , 0, 2)T /3 and�2 = (0, 0, 3, 4, 0, ⋅ ⋅ ⋅ , 0)T /5.

For each model, 200 replicate data sets were simulated for each of 3 sample sizes:n = 200,400 and 600. For the 3 methods concerned, the number of slices was fixedas follows: with 300

continuous responses, 5 slices were used whenn = 200, 10 at each of the larger sample sizes;in the discrete response models, II and IV, the number of distinctY values was used, being4 and 3 respectively. The Gaussian kernel and its corresponding optimal bandwidth were usedfor minimum average variance estimation and sliced regression. Our core algorithm used� =0.05 andJ = 1000 throughout. Estimation error was measured by the Frobenius norm of the305

difference between the matrices representing Euclidean orthogonal projection onto the centralsubspace and its estimate. Table 1 summarizes these estimation accuracy measures, while Table2 compares the computing time of the sliced regression and Hellinger methods. Allcomputationswere done in Matlab version 7.12 on an office PC.

Table 1.Mean (standard deviation) of the estimation errorsSIR SAVE PHD MAVE SR H2

Model In = 200 0.63 (0.14) 0.73 (0.17) 1.00 (0.01) 0.99 (0.05) 0.20 (0.08) 0.40 (0.12)n = 400 0.49 (0.11) 0.43 (0.12) 1.00 (0.01) 0.98 (0.04) 0.11 (0.04) 0.21 (0.06)n = 600 0.42 (0.09) 0.33 (0.09) 1.00 (0.01) 0.98 (0.04) 0.09 (0.02) 0.17 (0.05)Model IIn = 200 0.99 (0.02) 0.58 (0.19) 0.97 (0.04) 0.32 (0.16) 0.74 (0.25) 0.39 (0.10)n = 400 0.99 (0.02) 0.29 (0.07) 0.97 (0.04) 0.18 (0.04) 0.37 (0.19) 0.29 (0.07)n = 600 0.98 (0.02) 0.21 (0.04) 0.97 (0.06) 0.14 (0.03) 0.24 (0.09) 0.23 (0.06)Model IIIn = 200 0.39 (0.09) 0.81 (0.17) 0.95 (0.06) 0.75 (0.16) 0.38 (0.11) 0.39 (0.11)n = 400 0.27 (0.06) 0.49 (0.16) 0.95 (0.06) 0.71 (0.17) 0.23 (0.05) 0.24 (0.06)n = 600 0.21 (0.04) 0.43 (0.14) 0.94 (0.07) 0.66 (0.17) 0.19 (0.05) 0.20 (0.05)Model IVn = 200 0.46 (0.10) 0.79 (0.18) 0.95 (0.07) 0.84 (0.14) 0.63 (0.17) 0.48 (0.13)n = 400 0.27 (0.06) 0.63 (0.20) 0.95 (0.07) 0.74 (0.17) 0.42 (0.14) 0.33 (0.08)n = 600 0.22 (0.04) 0.40 (0.14) 0.95 (0.07) 0.66 (0.17) 0.33 (0.08) 0.28 (0.06)

SIR, sliced inverse regression; SAVE, sliced average variance estimation; PHD, principal Hessian di-rections; MAVE, minimum average variance estimation; SR, sliced regression; H2, Hellinger integralof order two.

Table 2.Computation cost (CPU time in seconds) for 200data replicates

Model I Model II Model III Model IVH2 SR H2 SR H2 SR H2 SR

n = 200 11 457 10 416 10 397 10 445n = 400 28 1224 27 1147 26 1185 31 1325n = 600 44 1913 54 1940 43 2045 55 2078

H2, Hellinger integral of order two; SR, sliced regression.

Overall, the estimation accuracy of sliced regression is seen to be good, in absolute terms and 310

relative to the four other existing methods. This confirms results reported in Wang & Xia (2008).However, using local smoothing, its computational cost increases dramatically with n. Relative tosliced regression, the Hellinger core algorithm is seen to be much more efficient computationally,while having broadly comparable accuracy. Indeed, it has the edge in Models II and IV where

Page 10: Dimension reduction based on the Hellinger integral - Statisticsstatistics.open.ac.uk/802576CB00593013/(httpInfoFiles... · Biometrika (xxxx), xx, x, pp. 1–14 ⃝C 2012 Biometrika

10 Q. WANG, X. Y IN AND F. CRITCHLEY

Table 3.Permutation test fordMProportions of estimated dimensions

Model d = 0 d = 1 d = 2 d = 3 d = 4

I 0 0.97 0.03 0 0II 0 0.12 0.88 0 0III 0 0.21 0.70 0.09 0IV 0 0.28 0.67 0.05 0

the response takes only a few discrete values. Again, this is consistent withFigure 1 of Wang &315

Xia (2008). Whereas sliced regression does best at filtering out the effects of extreme responsesin Model I, improving estimation accuracy, the Hellinger method out-performs the other fourexisting methods due to its use of joint localization.

Sliced regression estimation of the central subspace is√n-consistent under conditions detailed

in Wang & Xia (2008), includingdY ∣X ≤ 3 and optimal bandwidth selection. The other existing320

methods are well-known to miss certain types of dimension inSY ∣X preventing its exhaustiveand, hence, consistent estimation. Here, sliced inverse regression misses the first, symmetric,term in Model II, sliced average variance estimation the linear regression mean term in ModelsIII and IV, while principal Hessian directions and minimum average variance estimation onlyestimate directions in the central mean subspace, and so miss the regression variance term in325

Models III and IV. Whereas finite sample performance is of paramount importance in practice,establishing conditions for consistent estimation using our local Hellinger method is a key partof the future theoretical work noted at the start of Section 4⋅1. Numerical indications are good.Exhaustiveness and correct determination of structural dimension do not seem to be an issue inthe models simulated here. Recalling thatdY ∣X = 1 in Model I, while dY ∣X = 2 in the others,330

Table 3 reports the performance of our permutation test procedure estimating dM for sample sizen = 400. As the proportions in boldface show,dM typically agrees withdY ∣X across all fourmodels. Empirical evidence for

√n-consistency of our core algorithm can also be adduced in

the same way that, for sliced regression, it is confirmed in Wang & Xia (2008). For each model,further simulations were run for intermediate sample sizes providing averageestimation errors335

�n for 5 values ofn: 200, 300, 400, 500 and 600. Figure 1 shows the results for Model I.Ifour method were indeed

√n-consistent, we would expect an approximately linear relationship

between�n and1/√n. Figure 1 clearly confirms such an expectation for Model I, similar results

being found for the other three models.

Fig. 1.√

n-consistency plot

0.04 0.05 0.06 0.07

0.2

0.3

0.4

1/√n

Ave

rage

est

imat

ion

erro

r

Page 11: Dimension reduction based on the Hellinger integral - Statisticsstatistics.open.ac.uk/802576CB00593013/(httpInfoFiles... · Biometrika (xxxx), xx, x, pp. 1–14 ⃝C 2012 Biometrika

Hellinger dimension reduction 11

The good performance of our core algorithm reported here is confirmedin extensive wider 340

numerical studies. Its computational efficiency means that it remains an operational methodologyfor data sets whose size puts them beyond the scope of sliced regression. The analysis of realdata presented next illustrates this.

5⋅2. Analysis of community and crime dataThere have been extensive studies on the relationship between violent crimes and the socio- 345

economic environment. The data set analyzed here contains information from three sources:social-economic data from the 1990 US census, law enforcement data from the 1990 US LawEnforcement Management and Administrative Statistics Survey, and crime data from the 1995Federal Bureau of Investigation Uniform Crime Report. Further details about the data and theattributes used can be found at the University of California Irvine Machine Learning Repository: 350

visit http://archive.ics.uci.edu/ml/datasets.html. There aren = 1994 observations from differentcommunities across the US. The response variable is the per capita number ofviolent crimes.The predictors included in our analysis are shown in the second column of Table 4.

Table 4.Community and Crime

Predictor �1 �2 �3

x(1) percentage of population that is 65 and over in age -0.05 -0.04 0.11x(2) median family income 0.08 -0.18 0.69x(3) percentage of people under the poverty level 0.91 0.20 0.15x(4) unemployment rate 0.02 0.01 0.19x(5) percentage of population who are divorced 0.04 0.01 -0.04x(6) percentage of kids born to never married parents 0.17 -0.93 -0.17x(7) percentage of people who speak only English -0.04 -0.03 0.07x(8) mean persons per household -0.05 0.02 0.08x(9) percentage of people in owner occupied households 0.22 -0.09 -0.34x(10) percentage of housing occupied 0.05 0.07 -0.05x(11) median value of owner occupied house 0.16 0.17 -0.53x(12) population density in persons per square mile 0.15 -0.02 0.02x(13) percent of people using public transit for commuting -0.13 0.11 -0.05

Fig. 2. Community and crime

−1 0 1 2 3

−1

01

23

The first direction

Y

−3 −2 −1 0 1 2

−1

01

23

The second direction

Y

−2 −1 0 1 2 3

−1

01

23

The third direction

Y

All the variables were normalized into the range0.00− 1.00 using an unsupervised, equal-interval binning method in the original data set. The distributions of most predictors are very 355

skew, which precludes the use of inverse regression methods. The large sample size also preventsthe use of sliced regression.

Page 12: Dimension reduction based on the Hellinger integral - Statisticsstatistics.open.ac.uk/802576CB00593013/(httpInfoFiles... · Biometrika (xxxx), xx, x, pp. 1–14 ⃝C 2012 Biometrika

12 Q. WANG, X. Y IN AND F. CRITCHLEY

In practice, real data sets such as this can have a low signal-to-noise ratio. In such contexts,we have found it very helpful to filter out what might be called noisy neighbourhoods, in boththe estimation of directions and permutation test parts of our core algorithm. Thisis achieved360

straightforwardly by omitting observationsi for which the largest of the eigenvalues ofG−1i Fi

falls below a specified proportion of their total. Here, we use50% as the threshold value. Thepermutation test procedure withJ = 1000 givesp-values of 0, 0, 0.014 and 0.328, respectively,for the null hypothesesdM = 0, 1, 2 and3.

With dM = 3, we find the direction estimates reported in the last three columns of Table 4.365

The first direction is dominated byx(3), the percentage of people under the poverty level, and thesecond byx(6), the percentage of kids born to never married parents, while the third direction canbe seen as a combination of variables related to family structure. The scatter plots of responseagainst each of these three directions, Figure 2, confirm their importance. Both the poverty leveland the percentage of kids with unmarried parents have a significant positive effect on the crime370

rate. The contrast between median owner-occupied house value and median family income isanother important factor.

6. DISCUSSION

We indicate here some variations and extensions of the new approach to dimension reductionand its fast, local, core algorithm introduced above, additional to those in thebody of the paper.375

The benefits, at greater computational expense, of more refined choices can be explored inseveral further directions, including that of the thresholding of noisy neighbourhoods describedat the end of Section 5⋅2. Again, in the discrete response case, discarding all cases withk − k′ibelow some data-determined threshold may increase stability.

In practice, with continuous responses, exhaustiveness can be assessed by examining residuals380

from a smooth fit of the regression surface over the estimated central subspace. This may bedone graphically and then, if required, by further dimension reduction, treating these residualsas responses. In the discrete case, there are several types of residual that could be tried.

With discrete responses, the estimated central subspace is naturally invariant to one-to-onetransformation of the responses, mirroring the same property in the population. This property can385

be achieved in the continuous case by binning responses appropriately,with some correspondingloss of information.

In principle, the methodology may be extended to multivariate responses, including reducedrank regression models. Other possible enhancements include variable selection.

Finally, the local estimate produced by the core algorithm may prove to be a useful starting390

point for the naturally exhaustive global approach outlined at the end ofSection 2⋅3.

ACKNOWLEDGMENTS

We are grateful to the Editor, an Associate Editor and two anonymous reviewers for theirvery insightful and thorough reviews which led to substantial improvements inthe manuscript,especially in its presentation, and to M.C. Jones for helpful discussions. Yin’s research was395

supported in part by grants from the National Science Foundation, U.S.A..

Page 13: Dimension reduction based on the Hellinger integral - Statisticsstatistics.open.ac.uk/802576CB00593013/(httpInfoFiles... · Biometrika (xxxx), xx, x, pp. 1–14 ⃝C 2012 Biometrika

Hellinger dimension reduction 13

APPENDIX

Technical detailsProof of Theorem1. As is natural,X 2(S ∣ {0p}) = X 2(S) andX 2({0p} ∣ S) = 0. Using (1),

the result follows at once if eitherS1 or S2 is the trivial subspace. 400

Suppose then thatS1 = Span(u1) andS2 = Span(u2) are nontrivial subspaces ofℝp meetingonly at the origin, so that(u1, u2) has full column rank and spans their direct sum. Then

R(xu1 , xu2 ; y) = R(xu1 ; y)R(xu2 ; y ∣ xu1)

so that, usingE(Xu2 ,Y )∣Xu1

[{R(Xu2 ;Y ∣ Xu1)}−1

]= 1,

H(u1, u2)−H(u1) = E(X,Y )[R(Xu1 ;Y ){H(u2 ∣ u1)− 1}] = �2(u2 ∣ u1) ≥ 0,

equality holding if and only ifY ⊥⊥Xu2 ∣ Xu1 . □

Proof of Theorem2. If S = ℝp, part 1 is trivial. Otherwise, it is immediate from Theorem 1,405

takingS2 as the orthogonal complement inℝp of S1 = S. SinceSY ∣X is the intersection of alldimension reduction subspaces, parts 2 and 3 follow at once. Part 4 is immediate from part 1 andTheorem 1. □

Proof of Theorem3. Writing LY ∣x = {y : (x, y) ∈ L} and recalling the notationSY ∣X =

Span(�), so thatp(y ∣ x) = p(y ∣ �Tx), for each(x, y) ∈ L we have: 410

∂pL(y ∣ x)/∂xpL(y ∣ x)

=∂p(y ∣ x)/∂x

p(y ∣ x) −∂ pr(Y ∈ LY ∣x ∣ x)/∂x

pr(Y ∈ LY ∣x ∣ x)(7a)

= �

{∂p(y ∣ �Tx)/∂(�Tx)

p(y ∣ �Tx) −∂ pr(Y ∈ LY ∣x ∣ �Tx)/∂(�Tx)

pr(Y ∈ LY ∣x ∣ �Tx)

}, (7b)

(7b) implying that∂pL(y ∣ x)/∂x ∈ SY ∣X . The inclusion now follows at once from Lemma 1.□

Proof of Corollary 1. It suffices to prove the reverse inclusion to (2). By hypothesis, each(x, y) ∈ Ω belongs to someL = LX × ΩY ∈ ℒ so that the second term on the right-hand side415

of (7a) vanishes, pr(Y ∈ LY ∣x ∣ x) = 1 being constant here. The result follows by Lemma 1.□

Proof of Proposition1. The given transformationxobsi → xobsi∗ inducesV → V∗ = ATV A and

xi → Qxi, whereQ = V−1/2∗ ATV 1/2 is orthogonal while, the index setsNi being unchanged,

Fi → QFiQT andGi → QGiQ

T . Thus, eachui → Qui so that, eigenvalues being unchanged,� → Q� for each eigenvector� of M , resulting inSY ∣X → SY ∣X∗

= QSY ∣X . Overall, 420

SY ∣Xobs → SY ∣Xobs∗

= V−1/2∗ SY ∣X∗

= V−1/2∗ QV 1/2SY ∣Xobs = A−1SY ∣Xobs .

REFERENCES

COOK, R. D. (1994). On the interpretation of regression plots.J. Am. Statist. Assoc.89, 177–190.COOK, R. D. (1996). Graphics for regressions with a binary response.J. Am. Statist. Assoc.91, 983–992.COOK, R. D. (1998a).Regression Graphics: Ideas for studying regressions through graphics. New York: Wiley. 425

COOK, R. D. (1998b). Principal Hessian directions revisited (with Discussion). J. Am. Statist. Assoc.93, 84–100.COOK, R. D. & WEISBERG, S. (1991). Discussion of Li (1991).J. Am. Statist. Assoc.86, 328–332.COOK, R. D. & Y IN , X. (2001) Dimension reduction and visualization in discriminant analysis.Aust. N.Z. J. Stat.

43, 147-199.

Page 14: Dimension reduction based on the Hellinger integral - Statisticsstatistics.open.ac.uk/802576CB00593013/(httpInfoFiles... · Biometrika (xxxx), xx, x, pp. 1–14 ⃝C 2012 Biometrika

14 Q. WANG, X. Y IN AND F. CRITCHLEY

HARDLE, W. & STOKER, T. M. (1989). Investigating smooth multiple regression by method of average derivatives.430

J. Am. Statist. Assoc.84, 986–995.HERNANDEZ, A. & V ELILLA , S. (2005). Dimension reduction in nonparametric kernel discriminantanalysis. J.

Comp. Graph. Statist.14, 847–866.L I , B., ZHA , H. & CHIAROMONTE, F. (2005). Contour regression: a general approach to dimension reduction.Ann.

Statist.33, 1580–1616.435

L I , K. C. (1991). Sliced inverse regression for dimension reduction (withDiscussion). J. Am. Statist. Assoc.86,316–342.

L I , K. C. (1992). On principal Hessian directions for data visualization anddimension reduction: Another applicationof Stein’s lemma.J. Am. Statist. Assoc.87, 1025–1039.

MA , Y. & Z HU, L. (2013). A review on dimension reduction.Int. Stat. Rev.81, 134–150.440

SAMAROV, A. M. (1993). Exploring regression structure using nonparametric function estimation.J. Am. Statist.Assoc.88, 836–847.

VAJDA, I. (1989).Theory of Statistical Inference and Information.Netherlands: Kluwer Academic Publishers.WAND , M. P. & JONES, M. C. (1995).Kernel smoothing. London: Chapman & Hall/CRC.WANG, H. & X IA , Y. (2008). Sliced Regression for Dimension Reduction.J. Am. Statist. Assoc.103, 811–821.445

X IA , Y. (2007). A constructive approach to the estimation of dimension reduction directions. Ann. Statist.35,2654–2690.

X IA , Y., TONG, H., LI , W. & ZHU, L. (2002). An adaptive estimation of dimension reduction space.J. R. Statist.Soc. B64, 363–410.

Y IN , X. (2010). Suffcient dimension reduction in regression. InThe analysis of high dimensional data, Ed. Shen, X.450

and Cai. T., 257-273. New Jersey: World Scientific.Y IN , X. & COOK, R. D. (2002). Dimension reduction for the conditionalk-th moment in regression.J. R. Statist.

Soc. B64, 159–175.Y IN , X. & COOK, R. D. (2005). Direction estimation in single-index regressions.Biometrika92, 371-384.Y IN , X., L I , B. & COOK, R. D. (2008). Successive direction extraction for estimating the central subspace in a455

multiple-index regression.J. Mult. Anal.99, 1733–1757.ZENG, P. & ZHU, Y. (2010). An integral transform method for estimating the central mean and central subspaces.J.

Mult. Anal.101, 271-290.ZHU, Y. & Z ENG, P. (2006). Fourier methods for estimating the central subspace and the central mean subspace in

regression.J. Am. Statist. Assoc.101, 1638–1651.460

[Received xxxxxx. Revised xxxxxx]