# improved risk bounds in isotonic bodhi/research/ · improved risk bounds in isotonic

Post on 07-Jun-2019

214 views

Embed Size (px)

TRANSCRIPT

Improved Risk Bounds in Isotonic Regression

Sabyasachi Chatterjee, Adityanand Guntuboyina and Bodhisattva Sen

Yale University, University of California at Berkeley and Columbia University

November 13, 2013

Abstract

We consider the problem of estimating an unknown non-decreasing se-quence from finitely many noisy observations. We give an improvedglobal risk upper bound for the isotonic least squares estimator (LSE)in this problem. The obtained risk bound behaves differently dependingon the form of the true sequence one gets a whole range of rates fromlog n/n (when is constant) to n2/3 (when is uniformly increasingin a certain sense). In particular, when has k constant pieces then therisk bound becomes (k/n) log(en/k). As a consequence, we illustratethe adaptation properties of the LSE. We also prove an analogue of therisk bound for model misspecification, i.e., when is not non-decreasing.We also prove local minimax lower bounds for this problem which showthat the LSE is nearly optimal in a local nonasymptotic minimax sense.

1 Introduction

We consider the problem of estimating an unknown non-decreasing regressionfunction f0 from finitely many noisy observations. Specifically, only under theassumption that f0 is non-decreasing, the goal is to estimate f0 from data(x1, Y1), . . . , (xn, Yn) with

Yi = f0(xi) + i, for i = 1, 2, . . . , n, (1)

where 1, . . . , n are i.i.d mean zero errors with variance 2 > 0 and x1 < n/2} (here I denotes the indicator function) and hence

4

V () = 1. Then RZ(n; ) is essentially (2/n)2/3 while R(n; ) is much smaller

because it is at most (322/n) log(en/2) as can be seen by taking = in (8).

More generally by taking = in the infimum of the definition of R(n; ),we obtain

E`2(, ) 16k()2

nlog

en

k(), (9)

which is a stronger bound than (5) when k() is small. The reader may observethat k() is small precisely when the differences i i1 are sparse.

We prove some properties of our risk bound R(n; ) in Section 3. In Theo-rem 3.1, we show that R(n; ) is bounded from above by a multiple of RZ(n; )that is at most logarithmic in n. Therefore, our inequality (7) is always onlyslightly worse off than (5) while being much better in the case of certain se-quences . We also show in Section 3 that the risk bound R(n; ) behavesdifferently depending on the form of the true sequence . This means that thebound (7) demonstrates adaptive behavior of the LSE. One gets a whole rangeof rates from (log n)/n (when is constant) to n2/3 up to logarithmic fac-tors in the worst case (this worst case rate corresponds to the situation wheremini(ii1) & 1/n). The bound (7) therefore presents a bridge between thetwo terms in the bound (5). We suspect that such adaptation of the LSE inglobal estimation is rather unique to shape-constrained estimation. It has alsobeen observed to some degree in Guntuboyina and Sen (2013) in the relatedconvex regression problem.

In addition to being an upper bound for the risk of the LSE, we believe thatthe quantity R(n; ) also acts as a benchmark for the risk of any estimator inmonotone regression. By this, we mean that, in a certain sense, no estimatorcan have risk that is significantly better than R(n; ). We substantiate thisclaim in Section 4 by proving lower bounds for the local minimax risk near thetrue . For M, the quantity

Rn() := inft

suptN()

Et`2(t, t)

withN() :=

{t M : `2(t, ) . R(n; )

}will be called the local minimax risk at (see Section 4 for the rigorous def-inition of the neighborhood N() where the multiplicative constants hiddenby the . sign are explicitly given). In the above display ` is defined as`(t, ) := maxi |ti i|. The infimum here is over all possible estimators t.Rn() represents the smallest possible (supremum) risk under the knowledgethat the true sequence t lies in the neighborhood N(). It provides a measureof the difficulty of estimation of . Note that the size of the neighborhoodN() changes with (and with n) and also reflects the difficulty level of theproblem.

5

Under each of two following setups for , and the assumption of normalityof the errors, we show that Rn() is bounded from below by R(n; ) up tomultiplicative logarithmic factors of n. Specifically,

1. when the increments of (defined as i i1, for i = 2, . . . , n) grow like1/n, we prove in Theorem 4.3 that

Rn() &

(2V ()

n

)2/3&R(n; )

log(4n); (10)

2. when k() = k and the k values of are sufficiently well-separated, weshow in Theorem 4.4 that

Rn() & R(n; )(

logen

k

)2/3. (11)

Because R(n, ) is an upper bound for the risk of the LSE and also is a localminimax lower bound in the above sense, our results imply that the LSE isnear-optimal in a local non-asymptotic minimax sense.

We also study the performance of the LSE under model misspecification whenthe true sequence is not necessarily non-decreasing. Here we prove in The-orem 5.1 that E`2(, ) R(n; ) where denotes the non-decreasing pro-jection of (see Section 5 for its definition). This should be contrasted withthe risk bound of Zhang (2002) who proved that E`2(, ) . RZ(n; ). Asbefore our risk bound is atmost slightly worse (by a multiplicative logarithmicfactor in n) than RZ but is much better when k() is small. We describe twosituations where k() is small when itself has few constant blocks (see(58) and Lemma 5.4) and when is non-increasing (in which case k() = 1;see Lemma 5.3).

The paper is organized as follows: In Section 2 we state and prove our mainupper bound for the risk E`2(, ). We investigate the behavior of R(n; )for different values of the true sequence and compare it with RZ(n; ) inSection 3. Local minimax lower bounds for Rn() for two different scenariosfor are proved in Section 4. We study the performance of the LSE undermodel misspecification in Section 5. Section 6 gives some auxiliary resultsneeded in the proofs of our main results.

Finally, we note that all the results in our paper are non-asymptotic in nature.Moreover, all the proofs make use of very light probabilistic apparatus and arethus accessible to a larger readership.

6

2 Our risk bound

In the following theorem, we present our risk bound (7); in fact, we provea slightly stronger inequality than (7). Before stating our main result weintroduce some notation. For any sequence (a1, a2, . . . , an) Rn and any1 k l n, let

ak,l :=1

l k + 1

lj=k

aj. (12)

We will use this notation mainly when a equals either the sequence Y or .Our proof is based on the following explicit representation of the LSE (seeChapter 1 of Robertson et al. (1988)):

j = minlj

maxkj

Yk,l. (13)

For x R, we write x+ := max{0, x} and x := min{0, x}. For M and = (n1, . . . , nk) , let

D() =

1n

ki=1

sij=si1+1

(j si1+1,si

)21/2

where s0 = 0 and si = n1 + + ni for 1 i k. Like V(), this quantityD() can also be treated as a measure of the variation of with respect to .This measure also satisfies D() = 0 for every M. Moreover

D() V() for every M and .

When = (n) is the trivial partition, D() turns out to be just the standarddeviation of . In general, D2() is analogous to the within group sum ofsquares term in ANOVA with the blocks of being the groups. Below, weprove a stronger version of (7) with D() replacing V() in (7).

Theorem 2.1. For every M, the risk of the LSE satisfies the followinginequality:

E`2(, ) 4 inf

(D2() +

4k()

nlog

en

k()

). (14)

Proof. Fix 1 j n and 0 m n j. By (13), we have

j = minlj

maxkj

Yk,l maxkj

Yk,j+m = maxkj

(k,j+m + k,j+m

)where, in the last equality, we used Yk,l = k,l + k,l. By the monotonicity of, we have k,j+m j,j+m for all k j. Therefore, for every M, we get

j j (j,j+m j) + maxkj

k,j+m.

7

Taking positive parts, we have(j j

)+ (j,j+m j) + max

kj(k,j+m)+ .

Squaring and taking expectations on both sides, we obtain

E(j j

)2+ E

((j,j+m j) + max

kj(k,j+m)+

)2.

Using the elementary inequality (a+ b)2 2a2 + 2b2 we get

E(j j

)2+ 2

(j,j+m j

)2+ 2Emax

kj(k,j+m)

2+.

We observe now that, for fixed integers j and m, the process {k,j+m, k =1, . . . , j} is a martingale with respect to the filtration F1, . . . ,Fj where Fiis the sigma-field generated by the random variables 1, . . . , i1 and i,j+m.Therefore, by Doobs inequality for submartingales (see e.g., Theorem 5.4.3 ofDurrett (2010)), we have

Emaxkj

(k,j+m)2+ 4E (j,j+m)

2+ 4E (j,j+m)

2 42

m+ 1.

So using the above result we get a pointwise upper bound for the positive partof the risk

E(j j

)2+ 2

(j,j+m j

)2+

82

m+ 1. (15)

Note that the above upper bound holds for any arbitrary m, 0 m n j.By a similar argument we can get the following pointwise upper bound for thenegative part of risk which now holds for any m, 0 m j:

E(j j

)2 2

(j jm,j

)2+

82

m+ 1. (16)

Let us now fix = (n1, . . . , nk) . Let s0 := 0 and si := n1 + + ni for1 i k. For each j = 1, . . . , n, we define two integers m1(j) and m2(j) in thefollowing way: m1(j) = si j and m2(j) = j1si1 when si1 +1 j si.We use this choice of m1(j) in (15) and m2(j) in (16) to obtain

E(j j

)2 Aj +Bj

where

Aj := 2(j,j+m1(j) j