biometrika trust - dept. of statistics, texas a&m...

11
Biometrika Trust An Optimal Selection of Regression Variables Author(s): Ritei Shibata Source: Biometrika, Vol. 68, No. 1 (Apr., 1981), pp. 45-54 Published by: Biometrika Trust Stable URL: http://www.jstor.org/stable/2335804 . Accessed: 23/02/2011 00:33 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at . http://www.jstor.org/action/showPublisher?publisherCode=bio. . Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Biometrika Trust is collaborating with JSTOR to digitize, preserve and extend access to Biometrika. http://www.jstor.org

Upload: phunghanh

Post on 30-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Biometrika Trust - Dept. of Statistics, Texas A&M Universitysuhasini/teaching613/shibata81.pdf · Biometrika (1981), 68, 1, pp. 45-54 45 Printed in Great Britain An optimal selection

Biometrika Trust

An Optimal Selection of Regression VariablesAuthor(s): Ritei ShibataSource: Biometrika, Vol. 68, No. 1 (Apr., 1981), pp. 45-54Published by: Biometrika TrustStable URL: http://www.jstor.org/stable/2335804 .Accessed: 23/02/2011 00:33

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unlessyou have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and youmay use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at .http://www.jstor.org/action/showPublisher?publisherCode=bio. .

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printedpage of such transmission.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

Biometrika Trust is collaborating with JSTOR to digitize, preserve and extend access to Biometrika.

http://www.jstor.org

Page 2: Biometrika Trust - Dept. of Statistics, Texas A&M Universitysuhasini/teaching613/shibata81.pdf · Biometrika (1981), 68, 1, pp. 45-54 45 Printed in Great Britain An optimal selection

Biometrika (1981), 68, 1, pp. 45-54 45 Printed in Great Britain

An optimal selection of regression variables

By RITEI SHIBATA

Department of Mathematics, Tokyo Institute of Technology

SUMMARY

An asymptotically optimal selection of regression variables is proposed. The key assumption is that the number of control variables is infinite or increases with the sample size. It is also shown that Mallows's Qp, Akaike's FPE -and AIC methods are all asymptotically equivalent to this method.

Some key words: Selection of variables; Regression analysis; Multiple regression.

1. INTRODUCTION

Selection of regression variables has received considerable attention. In many papers (Newton & Spurrell, 1967; Allen, 1971; Mallows, 1973; Hocking, 1976; Park, 1977; Oliker, 1978) one of the basic assumptions is that the regression function

f(x) = E(Y I X = x)

of the observation Y on the control variable X is specified only by a finite number of parameters. This assumption is reasonable for obtaining an estimate or a testing procedure, but, for the true structure of the population, it is not so easily justified. Even if f (x) is continuous on some finite interval, we cannot avoid dealing with an infinite series expansion, polynomial expansion, orthogonal expansion and so on. Therefore it is rather natural to specify f(x) using infinitely many parameters (Sims, 1971).

Consider the Hilbert space 12 of sequences of real numbers with the inner product K.,.>

and the norm 11 11 Suppose that f(x) can be expressed as f(x) = <x, />, where x= (x1, x2,...) is the vector of control variables and /' = (1, /2,...) is the vector of parameters, both in 12. The observational equation is then Y = <x, g> +E, where E is the error variable normally distributed with mean 0 and unknown variance a 2 > 0. Given n independent observations on Y at x( ),..., x(), we can estimate effectively at most n parameters. Byj = (jl, ,j, Jk(j)), we denote the model with the regression function f(x) whose parameters are of the form

:(j)I= (0, **'. 1i ?'' ?' *j * *'3 jk)'' * ...)7

wherejl <j2 < ... < jk(j), k(j) > 1, and, by V(j), the subspace of such vectors. The model j selects not only a finite number of parameters but also the control variables xj1, ..., xj k(j. Therefore, we will not distinguish selection of variables from that of parameters.

Applying the model j, we have the least squares estimate of the specified elements of /3(j)' P(i)' = {I3h(j) .. .'3jk(,)(j)}, which are the solution of Mn( j)3(j) = X(j)'y. Here

Y= (Y, ... y,Y) is the vector of observations,

X(j) = {x,aj; 1 < oX < n, 1 < 1 < k(j)}

is the n x k(j) design matrix generated by the vectors x(a' = (xa1, xa2, . ..) (of = 1, . .., n),

Page 3: Biometrika Trust - Dept. of Statistics, Texas A&M Universitysuhasini/teaching613/shibata81.pdf · Biometrika (1981), 68, 1, pp. 45-54 45 Printed in Great Britain An optimal selection

46 RITEI SHIBATA

and Mn(j) = X(j)'X(j) is a k(j) x k(j) matrix. Hereafter 3(j) is occasionally considered as the corresponding infinite dimensional vector in 12 with undefined entries zero.

Using Euclidean norm 1 we can write the residual sum of squares as

nA2(j) = IIy-X(j)P(j) 112. The least squares predictor of a future observation on Y at x) is given by

Ya <x ,1(i)> ( =1,... ,n).

The expectation, with respect to future observations, of the sum of squared errors of prediction is then

112)II +nu 2, l

where

M = ( ixam; 1 < l,m < oo

is an infinite dimensional matrix, also considered as a nonnegative bounded linear operator on 12, and x IM = KM 2 (X, is a seminorm defined for any vector of in 12.

Since one of the main objects of regression analysis is to make a good prediction of future observations, we may formulate the problem in the following manner. Given a family of models Jn which is a finite collection of j's, possibly depending on the sample size n, we select a modelj in Jn, based only on the observations Yl, . . , Yn. Applying the selected model j, we obtain the least squares estimate of the parameters, based on the same observations.

Optimality of the selection is thus evaluated by the value of (1 1) when j is replaced byj. We remark here that this formulation applies not only for prediction but also for estimation of the mean vector, as the first term of (1L1) is also the squared error of the estimated mean vector.

In ? 2 we show that the selection j from Jn which minimizes the statistic

Sn(j) = {n+2k(j)} 2(j)

attains a lower bound for (I 1) in the limit as n tends to infinity. The proof of the result is similar to that for autoregressive processes (Shibata, 1980). However the assumptions here are that the observations are independent and the design matrix is known a priori, so that more detailed results can easily be obtained. In ? 3 the special case, where the ordering of the variables for inclusion is given a priori, is considered. Then in ? 4 the more general case of variable selection, where there is no such explicit ordering, is considered.

2. AN OPTIMAL SELECTION OF VARIABLES

Let :(n)(j) be a projection of/3 on the subspace V(j) with respect to the seminorm M.n which minimizes 11F -/3(j) IIMn in V(j). The first term of the prediction error (I 1) then equals

II-(j)_g(n)(j)II|2 +|Ig_g(n)(j)II2n (2.1)

As is well known, if k(j) < n and rank {Mn(j)} = k(j), then / :(j) _/(n)f(j) |2i/ur2 is distributed according to x2 with degrees of freedom k(j). Put

Ln(j) = E 1/ (j) -I3IIMn,

Page 4: Biometrika Trust - Dept. of Statistics, Texas A&M Universitysuhasini/teaching613/shibata81.pdf · Biometrika (1981), 68, 1, pp. 45-54 45 Printed in Great Britain An optimal selection

Optimal selection of regression variables 47

which then equals

I:I p pnl( j) I112M l+ k( j) 0r2.

Let j*(f) be an element of J,, which minimizes L,(j).

LEMMA 241. Let Xk be a X2 random variable with k degrees of freedom. Then for any 8 > 0,

PrXk2 <, k - ) K, e a(I - lk)2 K exp(- 12k)

if k > 8, otherwise 0, and

pr (Xk > k +8) <, e + Slk/) I <, exp(-48)

Proof. Consider the moment generating function ?b(6) = (1-20) ik* The Lemma is obtained by putting 20 = - (k -) and 20 = S/(k + 3), respectively.

THEOREM 241. Assume thatfor anyj in Jn, k(j) < n and rank {M"(j)} = k(j). IfJ,for 8 > 0,

lim x [{1 &?(j)} exp {2n(j)}]4k(J) -0, n f- oif6Jn: con(r)j)<1

where ?xn(j) = Ln(j)/{k(j) 2},then for any selection j in J,, possibly depending on the observations Y1, *Yn,

lim pr { (j)-/3|M/L"(j*(n))> 1-> } = 1. n - oo

Proof. Since

pr { min | |3(j) _2 / *(n)) ,1 -S} < pr [min{ |/3(j) -/3|j2M/Ln(j)} 1 1-s]

jeJn jehj

<, : pr flI (j)_g(n)(j)||2 /Xxr2 <, k(j) - SLnj)/'Or2}, jeJ,

we have the result by Lemma 241 and the definition of j*(n).

To show the selection 5 ttains the lower bound L,(j*(n)) suggested by Theorem 241, we need the following assumptions which are essentially stronger than those in Theorem 2 1, as can be easily checked by the right-hand side of the first inequality in Lemma 2-1 together with the fact that aXn(j) > 1.

Assumption 1. For any j in Jn, rank {Mn(j)} = k(j) and max k(j) = o(n), for j E Jn.

Assumption 2. For any 0 < 3 < 1, :E S"L'(j) for jCe Jn, converges to zero as n tends to infinity.

Since the expectation of Sn(j) is Ln(j) + no2 + {2k(j)/n} {Ln(j) - 2k(j) U2}, heuristically, j may be expected to behave like j*(f) which minimizes Ln(j). This is the basic idea of Akaike's FPE or AIC method (Akaike, 1970, 1973, 1974). We now present a rigorous proof that for j the prediction error (I 1) is asymptotically minimized.

THEOREM 2-2. If Assumptions 1 and 2 are satisfied, then

plim ||3() J-/3 [|M/Ln(j*(n)) - 1. n -o0

Page 5: Biometrika Trust - Dept. of Statistics, Texas A&M Universitysuhasini/teaching613/shibata81.pdf · Biometrika (1981), 68, 1, pp. 45-54 45 Printed in Great Britain An optimal selection

48 RITEI SHIBATA

Proof. Putting S2(j) = y-X (j)3(n)(j) 112/n, we can rewrite S,(j) as

Sn(j) = Ln(j) + {k(j) a2 - 1(j) _(ln)(j) 2M} 2k( j){a2( j) - 2} +n{S2(j) -_ 2(j)} +n2,

(22)

where a2(j) = E{s2(j)}. Now, similarly as in Theorem 241, from Lemma 241 and the assumptions, we obtain

plim { 3(j)_3(gln)(j) ||M-k(j) u2}/L(j) = 0 (23) n - oo

uniformly in Jn. Thus the second term on the right-hand side of (2 2) is negligible relatively to Ln(j). For the third term, we have that

I A2(j)-2 < 32(j)-s2(j) I+ls2(j)-u2I

11 F(j)_g1n)( j)II|2 /n +I| _

1n)( j)jj2 M /n

+ 2 I EX{:-:(n)(j)} I1n+ I II E II2/n _a2 I,

wheree' =(e,...,e) and X= (x,; I xn,l 1< oo). Thus

k(j) I a (j) /L(j)

<, k (j) I II:(j) -:g(n)(j) 112 nk(j) a2 }/{nLnj}+kj

+2k(j) 11 - 1 II-( g l)(j) II mn1nLn(j) I+ I 11 112 /(na2)- I

I {max k(j)/n} [max { I I

:(j) _ (n)u2 I/Lj(i)} ? 1] jeJn j.Jn

+(21a) [{max k(j)/n} 11 E12/X}2 I+ I 11 E_ 112 /(na2) I_ jC-Jn

Recalling (2 3) and applying the law of large numbers, we get the convergence to zero in probability. Therefore the third term on the right-hand side of (2 2) is also negligible relatively to Ln(j). Not so are the last two terms, but the behaviour ofj depends only on the differences of Sn(j), and we may evaluate

A(j) = n12{ (j) _ u2(j)} -n{s2 (j *()) _ 2 (j*(n))} = 2e'X{fg(n)(j) _ -(n)( *(n))1

From the well-known evaluation of the tail of normal distribution (Feller, 1968, p. 175), using the monotone decreasing function 0(x) = (2/7T)2 x- 1 e- x2 on x > 0, and noting 2Ln(j) (n)3(j) _(n)(j*(fn)) 2 , we have for any 5

pr ( A(j)/Ln(j) I > 8) < b[SLn(j)/{2r II /3(n))(j) _/(n))(j *(fn)) I I}]

? b{8Ln(j)4/(2V2a)} K 141(V7T)} exp { _82 Ln(j)/(16u2)}. (2 4)

Assumption 2 implies that the sum of the right-hand side of (2 4) over Jn converges to zero and A(j) is negligible relatively to Ln(j) uniformly in Jn. Accordingly

{Sn(j) -Sn(j*(n))}/Ln(j) - {Ln(j) -Ln(j*(n))}/L(j)

converges to zero in probability uniformly in Jn, and the definition ofj then implies that, for any 8 > 0,

lim pr {Ln( j*(n))/Ln( J) ) 1 -S} = 1. n -0

Page 6: Biometrika Trust - Dept. of Statistics, Texas A&M Universitysuhasini/teaching613/shibata81.pdf · Biometrika (1981), 68, 1, pp. 45-54 45 Printed in Great Britain An optimal selection

Optimal selection of regression variables 49

Recalling that L,(j*(n))/Ln(j) ; 1, we have

plimLn(j*( ))/Ln(j) = 1. (2 5) n -

Finally replace j by j in (2 3) and add 1 to both sides of (2 3). It is then enough to use (2 5).

The results in this section hold even if the parameter : depends on the sample size n.

3. SELECTION OF THE NUMBER OF VARIABLES

If an ordering of variables is given a prtorti, we may consider the family of models of the following type

in = {(I), (1, 2), ..., (1, ...,~ Kn)}

The problem is then reduced to the choice of the number of variables, and the model is specified only by the number of variables k = k(j), instead of j. Assumption 2 can be replaced by one of the following conditions.

CONDITION 1. There exists a divergent sequence {kn} such that kn ? Kn and log k, = o{ I | _ g(n)( 11 Mn

Assumption 2 follows from this condition, since, for any 0 < 8 < 1, Kn kn Kn

E exp {L(k) log 8} = exp L(k) log 5} + E exp {L(k) log 8} k=1 k=1 k=kn+ 1

< exp { _| g-fl)(kn) II1Mn log 8}

+exp (kn(u2 log 8)/{1 -exp (U2 log 8)}. (3.1)

CONDITION 2. The sequence {K"} diverges to infinity with n and11/ || (n)(k) IKr ndiverges to infinity for any fixed k > 0.

It is easy to show the equivalence of Conditions 1 and 2. Even if : has only a finite number of nonzero coordinates, the above conditions can be satisfied when this number increases with n, see Example 3 3.

THEOREM 3-1. Assume that there exists a positive divergent sequence {cn} such that as a linear operator cn- 1Mn weakly converges to a nonsingular, i.e. one to one, infinite dimensional matrix M, whose k x k principal submatrix M(k) has full rank k for any k > 0. If ,B does not depend on n and has infinitely many nonzero coordinates, then for any fixed k > 0,

II _ (n)(k) J12 (3-2)

is bounded, and bounded away from zero for large enough n, and k*(n) = k(j*(n)) diverges to infinity as n -+ oo. Condition 2 is satisfied when Kn diverges to infinity as n -+ 00.

Proof. For any fixed k, the minimum eigenvalue An(k) of c" 'MA(k) converges to that of the principal submatrix M(k) of M, so that An(k) is bounded away from zero for n sufficiently large. Since $(n)(k) is the projection of /, the relation jj (n)(k) 112 = KMn /, /B(n)(k)> yields

jj :(n)(2) jj jj c<' Mn jj /3 lI/An(k). (33

Page 7: Biometrika Trust - Dept. of Statistics, Texas A&M Universitysuhasini/teaching613/shibata81.pdf · Biometrika (1981), 68, 1, pp. 45-54 45 Printed in Great Britain An optimal selection

50 RITEI SHIBATA

Here, by the Banach-Steinhaus theorem (Yosida, 1968, p. 73), the operator norms nc,' M,| for n = 1, 2, ... are bounded, which implies the boundedness of the right-hand

side of (3 3). Now, (3 2) is rewritten as

< 1 Mn-M) {/3 -_:(n)(k)}, {3 _/(n)(k)}> + IIi1 -_(n)(k) 112 (3-4)

where 11 ac xm = <MAx, ot>2 is a norm on 12, for M is a one-to-one positive hermitian operator. The first term of (3-4) converges to zero, since only the first k coordinates of /3- /(n)(k) can depend on n and these are bounded. Let /3(0')(k) be the projection of:3 on V(k) with respect to 11 IM then for the second term of (3*4),

whose right-hand side is nonzero for any k and independent of n. Consequently (3 4) is bounded away from zero for sufficiently large n. The boundedness of (34) is clear. Furthermore, letting /(3)(k) the projection of : on V(k) with respect to the norm j. jj, we have

I I _ (n)(k)|| nc |c- n| 112:? k 112. (3-5)

This proves the divergence of k*(n). In fact, if k* = lim infk*(f), as n so, is finite, we can choose a divergent sequence {kn} such that k* < kn < Kn and kn = o(cn). From the definition of k*(n), for infinitely many n,

Ln(k*)ICn = LA(k*(n))lcn < Ln(kn)/cn. (3 6)

From (3 5), the right-hand side of (3 6) converges to zero. Thus, there exists a subsequence {n'} and

lim 1||-3(n)(k*)II2/cnn = 0. n'- -+0

An application of the first part of the proof gives lim 11n/(fl)(k*) IIM = 0, as n' -x oo. This contradicts the fact that

I I _ (n) (k*) ||m >, IIg-g( O(k*) ||M

for any n, and the proof is complete.

We remark here that even if:3 depends on n, Theorem 3 1 holds when : converges, with respect to the norm 11. 11, to a vector which has infinitely many nonzero coordinates.

In order to show how to verify Assumptions 1 and 2 by applying Theorem 3 1, we give some examples.

Example 3 1. Polynomial regression. If the regression function is of the form 00

f (X)= E x191+ (9 C_12, O x<, X ), 1=0

and if Yl, ..., Yn are observed at x = 0, 8, 28, ..., {(n-1)/n} 8 for 0 < 8 < 1, then the observational equations are

00 ( - A

Page 8: Biometrika Trust - Dept. of Statistics, Texas A&M Universitysuhasini/teaching613/shibata81.pdf · Biometrika (1981), 68, 1, pp. 45-54 45 Printed in Great Britain An optimal selection

Optimal selection of regression variables 51

In this case,

Mn= {Q 1 )-+m ; 1 < i, m < oo}

and n-l Mn uniformly converges to M = {1l+mml/(l+m -1); 1 1, m < oo} because of

1 n (o - I\I+m-2 I1+m-1 1

_I(-| 8 | - <_1+m-2 na=1 n l+m-1 n

From a result about Stieltjes transforms (Widder, 1946, p. 336, Theorem 5a), if

00 X11+m_=0 (m= 1,2,...),

then xi = 0 for all 1 = 1, 2,.... Accordingly M is nonsingular with linearly independent column or row vectors. Also M(k) is of full rank. Applying Theorem 3-1 we see Assumptions 1 and 2 are satisfied if:3 has infinitely many coordinates, for example,f (x) is logarithmic or exponential, and Kn diverges to infinity more slowly in order than n.

Example 3-2. Curve fitting. If the regression function is a real-valued function on [0, 1), which has continuous derivatives of order up to 2, then its Fourier series expansion is written as

00

1=0

with / in 12. Given observations at x = 0, I /n, ..., (n -1)/n, then 2n' Mn converges to the diagonal matrix M = diag (2, , , ...). Thus, as in Example 3 1, Assumptions 1 and 2 are satisfied if / has infinitely many nonzero coordinates and Kn diverges to infinity more slowly in order than n.

More generally, iff (x) is a function on (-1, 1), then the series (3 7) would have sine and cosine parts. We can apply the preceding result if some orderings of variables for inclusion are given a priori in those two parts; see ? 4. Other extensions are possible to the case when f (x) is complex valued or has other orthogonal expansions.

Example 3 3. Observations with repetition. Given observations repeatedly at only p distinct points, then the rank of Mn does not exceed p and Assumptions 1 and 2 are not satisfied. But a practical asymptotic approach is that p = p,, is considered as increasing with the sample size n rather than being fixed, and ni observations are given at the point xM (i = 1, . .. Pn), where n = n1 +... + nP and ni's are not necessarily fixed. If an ordering of variables for inclusion is predetermined and the vectors of control variables x(1), ... ., x

are linearly independent, it is easy to check Assumptions 1 and 2 by applying Theorem 3- 1.

4. SUBSET SELECTION

In this section we consider the general problem of selection of regression variables themselves, which has been called subset selection (McClave, 1978). Of course, it is not very realistic to consider all combinations of variables, because 2k_ 1 combinations are possible even if there are only a finite number, k, of variables under consideration. Another reason for not considering a large number card ( Jn) of models contained in Jn is

Page 9: Biometrika Trust - Dept. of Statistics, Texas A&M Universitysuhasini/teaching613/shibata81.pdf · Biometrika (1981), 68, 1, pp. 45-54 45 Printed in Great Britain An optimal selection

52 RITEI SHIBATA

that Assumption 2 could not be satisfied if card (Jo) were too large and Assumption 2 is needed to ensure that Theorem 2-2 holds true. A simple sufficient condition on card (Jo) for Assumption 2 is given by Condition 3.

CONDITION 3. The sequence k(j*(f)) diverges to infinity and log card (Jo) = o{k(j*(n))}.

It would also be unusual to have no information on the relative importance of the variables. It is often the case that from such prior information we can obtain some priorities or orderings of variables for inclusion in the model.

If the suggested ordering is unique, then all variables are totally ordered and the problem reduces to that examined in the preceding section. Otherwise, as a finite number of orderings are available, corresponding to each ordering, we may obtain a family of models. For example, from an ordering xi1, . . ., Xj, a family of models Jni = {(Jj, )l ( jl,j2), . .. (j 1, j2, *, Jjk)} follows in a natural manner.

Therefore if q orderings are suggested, the q families Jnl,,...,Jn are obtained and in = ini U ... U Jnq is the totality of the models under consideration. In each Jnl we can select a model, say j(), simply by specifying the number of variables as in ? 3; then, comparing j(1)'s, we have a resulting model j. An advantage of our selection is that it is independent of the particular partitioning of Jn used for the selection. This follows from the definition ofj. Provided the number q is independent of n, the problem reduces again to that considered in ? 3. That is, it is sufficient to verify Assumptions 1 and 2 in each Jnl by applying Theorem 3-1.

However, it is desirable to introduce more complex models incorporating an increasing number of variables, to obtain a good approximation to the structure of the population, as has been pointed out by Stone (1979). This implies an increase of q as n increases. In this case, our main concern is to find a bound for the rate of increase of q = qn, so that Assumption 2 is satisfied. By substituting

k = max[kn(j); {k( j) Mn}/IIg-g()(i) 1 C] i EJn

in (31), we have Condition 4.

CONDITION 4. For some C > 0, kn diverges to infinity as n -+ oo and log qn = o(kn).

Here kn is the largest number of variables for whichthe bias jj| - /_(n)(j) j2 is comparable with in order of magnitude, or larger than, the variance term k(j)a2 in Ln(j). The condition shows that if the bias decreases at a sufficiently fast rate as the number k(j) of variables increases, uniformly in Jn, then one cannot choose qn to be too large, since kn is small. This restriction does not weaken the value of our theory, because the rapid decrease of the bias means that it is not necessary to consider very many combinations of variables, since we can have a good approximation to the population structure by increasing only the number of variables.

As kn becomes close to k( j*(n)), using k( j*(n)) in (3d1) instead of kn, we have another type of condition.

CONDITION 5. The sequence k( j*(n)) diverges to infinity as n - oo and log qn = o{1k(j*(n))}.

In this condition, j*(f) can be replaced byj*(n l) which minimizes L ( j) in each Jnl. Then, practically, it is sufficient to check the divergence of k( j*(n~ l)) in each J,,1 by applying

Page 10: Biometrika Trust - Dept. of Statistics, Texas A&M Universitysuhasini/teaching613/shibata81.pdf · Biometrika (1981), 68, 1, pp. 45-54 45 Printed in Great Britain An optimal selection

Optimal selection of regression variables 53

Theorem 3-1 and determining whether log qn is smaller than

max k(j*(n,l)) 1 -I<q,

in order of magnitude. This is demonstrated in the following example.

Example 4d1. Estimation of mean of the multivari ate normal distribution. Given n observations on (Y1, ..., Yp) distributed independently according to the p-variate normal with mean /' = (7 ..7, gp) and if p is relatively large, one of the common estimation procedures is to estimate only some coordinates of / by putting unestimated coordinates at zero (Sugiura, 1978). Suppose that Jnl, ..., Jnqn are predetermined and that the number of nonzero coordinates in : increases as p increases, or that 3 converges to a vector, /(0), in 12 which has infinitely many nonzero coordinates. Since Mn = nIp, Theorem 3 1 holds true in each Jnl if card (Jnl) increases as n -> oo.

Then k(j*(nl,l)) is at least log n in order of magnitude if /(3), /(3), ... are decreasing exponentially. Therefore, if qn can be written as qn = nn, for example qn = 0(log n), Condition 5 is satisfied, where en is a sequence tending to zero. If, alternatively, /(3?), /), . . .

are decreasing to an xth power, then k(j*(nl,l)) is at least nlI( + 1) in order of magnitude, so that if qn can be written as qn = exp (n'1/(a 1) En), for example qn = 0(n), then Condition 5 is satisfied.

5. COMPARISONS WITH OTHER METHODS

We have demonstrated an optimality property ofj, but there are also other equivalent methods. One way of determining equivalent methods is to change the factor n + 2k(j) in Sn(j) by a small quantity Sn(j). If Sn(j) satisfies

plim max j Sn(j) I/n = 0, plim max { | Sn(j) - Sn(j*(n)) /Ln(Aj)} = O0 n - oo jeCJ,, n - oo je.4J,

nr~~~~~~~~~~~~~~~~~~~~) A2J njr iS then the selection which minimizes the statistic S(0)(j) = {n + 2k(j) + 8(j)} X2(j) is equivalent toj in our sense. This can be shown using the technique given in Theorem 4 2 of Shibata (1980). From this result, the equivalence of the Cp method (Mallows, 1973), the FPE method (Akaike, 1970), the AIC method (Akaike, 1973, 1974) and Sugiura's finite correction method (Sugiura, 1978) follows. However in small samples, the selection is very sensitive to small changes of Sn (j), so that it is necessary to check finite sample properties. Some results of computer simulations show that the FPE method has good small-sample properties. The FPE method is the selection minimizing

Sn?) (j) n +< )

k 7(j).A {nn- k(j)orU

The details will be given elsewhere. On the other hand, it can be shown that some methods proposed are not asymptotically

optimal in our sense. In fact, for n,(j) = xk(j) the corresponding selection is asymptotically optimal if and only if a, = 0 (Shibata, 1980). For example, the selection which minimizes the mean squared error nu2(j)/{n - k(j)} is asymptotically equivalent to the above with a, = -1, so that it is not asymptotically optimal in our sense. Another example is the selection which minimizes

log a2(j) + 2k(j) (log log n)/n.

Page 11: Biometrika Trust - Dept. of Statistics, Texas A&M Universitysuhasini/teaching613/shibata81.pdf · Biometrika (1981), 68, 1, pp. 45-54 45 Printed in Great Britain An optimal selection

54 RITEI SHIBATA

This was proposed as a selector of autoregressive variables by Hannan & Quinn (1979) under the assumption that the number of variables is finite. It is asymptotically equivalent to the above with ox = 2(log log n -1), so that it is also not asymptotically optimal in our sense. Furthermore it can be shown that the BIC method proposed by Schwarz (1978) and Akaike (1978) is also not asymptotically optimal in our sense. However, optimality of our selection is obtained only from consideration of the prediction error as given by (141) and under the assumption that the regression function is specified by infinitely many nonzero parameters or by an increasing number of variables as n -x oc.

The author wishes to thank the referees, Dr S. Mase and Dr M. Sibuya for their helpful advice. The revision was done during the time the author was staying in the Australian National University, and he is indebted to Dr S. Wilson and Professor E. J. Hannan for their helpful suggestions.

REFERENCES

AKAIKE, H. (1970). Statistical predictor identification. Ann. Inst. Statist. Math. 22, 203-17. AKAIKE, H. (1973). Information theory and an extension of the maximum likelihood principle. In 2nd

International Symposium on Information Theory, Eds B. N. Petrov and F. Csaki, pp. 267-81. Budapest: Akademia Kiado.

AKAIKE, H. (1974). A new look at the statistical model identification. I.E.E.E. Trans. Auto. Control 19,716-23. AKAIKE, H. (1978). A Bayesian analysis of the minimum AIC procedure. Ann. Inst. Statist. Math. A 30, 9-14. ALLEN, D. M. (1971). Mean square error of prediction as a criterion for selecting variables. Technometrics 13,

469-81. FELLER, W. (1968). An Introduction to Probability Theory and its Applications, I, 3rd edition. New York:

Wiley. HANNAN, E. J. & QUINN, B. G. (1979). The determination of the order of an autoregression. J. R. Statist. Soc. B

41, 190-5. HOCKING, R. R. (1976). The analysis and selection of variables in linear regression. Biometrics 62, 1-49. MALLOWS, C. L. (1973). Some comments on Cp. Technometrics 15, 661-75. MC CLAVE, J. T. (1978). Estimating the order of autoregressive models: The max x2 method. J. Am. Statist.

Assoc. 73, 122-8. NEWTON, R. G. & SPURRELL, D. J. (1967). A development of multiple regression for the analysis of routine data.

Appl. Statist. 16, 51-65. OLIKER, V. I. (1978). On the relationship between the sample size and the number of variables in a linear

regression model. Commun. Statist. A 7, 509-16. PARK, S. H. (1977). Selection of polynomial terms for response surface experiments. Biometrics 33, 225-9. SCHWARZ, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461-4. SHIBATA, R. (1980). Asymptotically efficient selection of the order of the model for estimating parameters of a

linear process. Ann. Statist. 8, 147-64. SIMs, C. A. (1971). Distributed lag estimation when the parametric space is explicitly infinite-dimensional.

Ann. Math. Statist. 42, 1622-36. STONE, M. (1979). Comments on model selection criteria of Akaike and Schwarz. J. R. Statist. Soc. B 41,276-8. SUGIURA, N. (1978). Further analysis of the data by Akaike's information criterion and the finite corrections.

Commun. Statist. A 7, 13-26. WIDDER, D. V. (1946). The Laplace Transform. Princeton University Press. YOSIDA, K. (1968). Functional Analysis. New York: Springer.

[Received ApWil 1979. Revised July 1980]