aggregation versus empirical risk minimization -...

Aggregation versus Empirical Risk Minimization

Guillaume Lecue1,2 Shahar Mendelson1,3

December 18, 2007

Abstract

Given a finite set F of estimators, the problem of aggregation is to construct anew estimator that has a risk as close as possible to the risk of the best estimatorin F . It was conjectured that empirical minimization performed in the convexhull of F is an optimal aggregation method, but we show that this conjectureis false. Despite that, we prove that empirical minimization in the convex hullof a well chosen, empirically determined, subset of F is an optimal aggregationmethod.

1 Introduction

In this article, we solve a problem concerning aggregation of estimators that wasposed by P. Massart. To formulate the problem we need several definitions.

Let Ω be a measurable set endowed with a probability measure µ and let F bea finite class of functions on Ω. Assume that Y is an unknown random variableone wishes to estimate, where the given data is n independent observations D =(Xi, Yi)

ni=1 distributed according to the joint probability distribution of µ and Y ,

which we denote throughout this article by ν. If f is a candidate estimator of Y , thequality of estimation of Y by f is given by the risk of f :

R(f) = Eℓ(f, Y ),

where ℓ : R2 → R is a non-negative function, called the loss function. If f is a

random function determined using the data D, the quality of estimation of Y by f isthe conditional expectation

R(f) = E(

ℓ(f , Y )|D)

.

1Centre for Mathematics and its Applications, The Australian National University, Canberra,ACT 0200, Australia and Department of Mathematics, Technion, I.I.T, Haifa 32000, Israel.Supported in part by an Australian Research Council Discovery grant DP0559465 and by an IsraelScience Foundation grant 666/06.

2Email: [email protected]: [email protected]

1

Throughout this article, we will restrict ourselves to functions f and targets Y thatare bounded in L∞ by b. Also, we will only consider finite classes F and denote theircardinality by M .

Given a map (or learning algorithm) A that assigns to a sample D a functionAD ∈ F and a confidence parameter δ, the uniform error rate of A is defined as thefunction H(n,M, δ) for which the following holds: for every integer n, every class Fof cardinality M and every target Y (all bounded by b), with νn-probability at least1 − δ (i.e. relative to samples of cardinality n),

R(AD) ≤ minf∈F

R(f) +H(n,F, δ).

One can show ([8], see also ,[5],[9],[4]), that if ℓ(x, y) = (x−y)2, then for every randommap A there exists a constant c depending only on the map and on δ, such that forevery n, H(n,M, δ) ≥ c/

√n. In fact, the result is even stronger - this lower bound

holds for every individual class F that satisfies certain conditions and for a widerclass of loss functions. For example, the conditions can be verified for the squaredloss if F is a finite class of functions.

The lower bound onH(n,M, δ) implies that regardless of the estimation procedureone chooses, it is impossible to obtain error rates that would converge to 0 at a ratefaster than 1/

√n - that holds uniformly for every F of cardinality M . Thus, to find

a procedure that would give faster rates than 1/√n one has to consider maps into

larger classes than F itself. This lead to the notion of aggregation (see, for example,[10],[1]).

In the aggregation framework, one is given a set F of M functions (usually selectedin a preliminary stage out of a larger class as potentially good estimators of Y ). Theproblem of aggregation is to construct a procedure that mimics the best element inF , without the restriction that A has to select a function in F . Having this in mind,one can define the optimal rate of aggregation (cf. [13]), in a similar way to the notionof the minimax rate of convergence for the estimation problem. This is the smallestprice that one has to pay to mimic, in expectation, the best element in a functionclass F of cardinality M from n observations. Here, we focus on results which holdswith hight probability. We will then consider the following definition of optimality.

Definition 1.1 A function ψδ(n,M) is an optimal rate of aggregation with confidenceδ and a procedure f is an optimal aggregation procedure with confidence δ if thereexists an absolute constant c1 for which the following hold:

• For any integers n and M , any set F of cardinality M and a target Y (allbounded by b), with νn-probability at least 1 − δ, f satisfies

R(f) ≤ minf∈F

R(f) + c1ψδ(n,M),

2

where R(f) is the conditional expectation E(ℓ(f , Y )|D).

• For any procedure f and any constant c2, there exist integers M and n, a set Fof cardinality M and a target Y (all bounded by b) such that, with νn-probabilityof at least 2δ,

R(f) ≥ minf∈F

R(f) + c2ψδ(n,M).

One can show (cf. [4],[13], [1]) that if the loss satisfies a slightly stronger propertythan convexity then the optimal rate of aggregation is

logM

n, (1.1)

which is significantly better than the 1/√n rate which is the best that one can obtain

when A is restricted to F itself.A natural procedure that is very useful in prediction is empirical minimization,

which assigns to each sample D the function that minimizes the empirical risk

Rn(f) =1

n

n∑

i=1

ℓ(f(Xi), Yi)

on a given set.One can show that for the type of classes we have in mind - finite, of cardinality M

- empirical minimization in F gives the optimal rate among all the maps A restrictedto taking values in F . Since there are optimal aggregation procedures that are convexcombinations of elements in F (see, for example, [4],[1]), it is tempting to believe thatempirical minimization performed in the convex hull of F rather than in F itselfwould achieve the optimal rate of aggregation (1.1). This was the question asked byP. Massart.

Question 1.2 Is empirical minimization performed on conv(F ) an optimal aggrega-tion procedure?

The reason one would expect the answer to this question to be positive stems in thefact that the approximation error inff∈conv(F )R(f) is likely to be significantly smallerthan minf∈F R(f) - to a degree that outweighs the increased statistical error causedby performing empirical minimization on the much larger set conv(F ).

Our first result is that contrary to this intuition, the answer to Question 1.2 isnegative in a very strong way.

Theorem A. There exist absolute constants c1, c2 and c3 for which the followingholds. For every integer n there is a set Fn of functions of cardinality M = c1

√n and

3

a target Y (all bounded by 1), such that with νn-probability of at least 1−exp(−c2√n),

R(f) ≥ minf∈Fn

R(f) +c3√n,

where f is the empirical minimizer in conv(F ) and R is measured relative to thesquared loss ℓ(x, y) = (x− y)2.

In other words, empirical minimization performed in conv(F ) does not even comeclose to the optimal aggregation rate, and is not far from the trivial rate that one canachieve by performing empirical minimization in F itself - which is c(δ)

√

(logM)/n.Nevertheless, understanding why empirical minimization does not perform well on

F and on conv(F ) as an aggregation method does lead to an improved procedure. Oursecond result shows that empirical minimization on an appropriate, data dependentsubset of conv(F ) achieves the optimal rate of aggregation in (1.1). To formulateour result, denote for every sample D = (Xi, Yi)

2ni=1, the subsamples D1 = (Xi, Yi)

ni=1

and D2 = (Xi, Yi)2ni=n+1. For every x > 0, let α = ((x + logM)/n)1/2, and for every

sample D = (Xi, Yi)2ni=1, set

F1 =

f ∈ F : Rn(f) ≤ Rn(f) + C1 max

α‖f − f‖Ln2, α2

, (1.2)

where f is an empirical minimizer in F with respect to D1, Ln2 is the L2 norm with

respect to the random empirical measure n−1∑n

i=1 δXiand C1 > 0 is a constant

depending only on ℓ and b.

Theorem B. Under mild assumptions on the loss ℓ, there exists a constant c1 de-pending only on b and ℓ for which the following holds. Let F be a class of functionsbounded by b of cardinality M and assume that Y is bounded by b.

For any x > 0, if f is the empirical minimizer in the convex hull of F1 with respectto D2 then, with ν2n-probability at least 1 − 2 exp(−x),

R(f) ≤ minf∈F

R(f) + c1(x+ 1)logM

n.

The geometric motivation behind the proof of Theorem B will be explained inthe following section and the proof of the theorem will appear in Section 4. We willpresent the proof of Theorem A in Section 5.

Finally, a word about notation. Throughout, we denote absolute constants by c1,c2, etc. Their values may change from line to line. Constants whose value will remainfixed are denoted by C1, C2, etc. Given a sample (Zi)

ni=1, we set Pn = n−1

∑ni=1 δZi

,the random empirical measure supported on Z1, ..., Zn. For any function f let(Pn − P )(f) = n−1

∑ni=1 f(Zi)− Ef(Z) and for a class of functions F , ‖Pn − P‖F =

supf∈F |(Pn − P )(f)|.

4

2 The role of convexity in aggregation

In this section, our goal is to explain the geometric idea behind the proofs of TheoremA and Theorem B. To that end, we will restrict ourselves to the case of the squaredloss ℓ(x, y) = (x− y)2 and a noiseless target function T : Ω → R.

Set fF = argminf∈FEℓ(f, T ) and observe that fF minimizes the L2(µ) distance

between T and F . Recall that our aim is to construct some f , such that with proba-bility at least 1 − δ,

‖f − T‖2L2(µ) ≤ ‖fF − T‖2

L2(µ) + c(δ)Φ(n,M),

where n is the sample size, |F | = M and Φ is as small as possible - hopefully, of theorder of n−1 logM .

The motivation to select f from C = conv(F ) is natural, since one can expect thatminh∈C ‖h− T‖L2(µ) = ‖fC − T‖L2(µ) is much smaller than ‖fF − T‖L2(µ). Moreover,it is reasonable to think that empirical minimization performed in C has a relativefast error rate, which we denote by c1(δ)Ψ(n,M). Therefore, if f is the empiricalminimizer in C then

‖f − T‖2L2(µ) ≤ ‖fC − T‖2

L2(µ) + c1(δ)Ψ(n,M)

≤ ‖fF − T‖2L2(µ) + c1(δ)Ψ(n,M) −

(

‖fF − T‖2L2(µ) − ‖fC − T‖2

L2(µ)

)

.

The hope is that the gain in the approximation error

(

‖fF − T‖2L2(µ) − ‖fC − T‖2

L2(µ)

)

is far more significant than Ψ(n,M), leading to a very fast aggregation rate.Although this approach is tempting, it has serious flaws. First of all, it turns out

that the statistical error of empirical minimization in a convex hull of M well chosenfunctions may be as bad as 1/

√n (for M ∼ √

n, see Theorem 5.5). Second, it ispossible to construct such a class and a target for which ‖fF−T‖L2(µ) = ‖fC−T‖L2(µ).Thus, there is no gain in the approximation error by passing to the convex hull.

The class we shall construct will be 0,±φ1, ...,±φM where (φi)Mi=1 is a specific

orthonormal family on [0, 1] and the target Y is φM+1, implying that fF = fC = 0.For this choice of F and Y one can show that Ψ(n, c1

√n) ≥ c2(δ)/

√n for suitable

constants c1 and c2(δ).Fortunately, not all is lost as far as using empirical minimization in a convex hull,

but one has to be more careful in selecting the set in which it is performed. The keypoint is to identify situations in which there is a significant gain in the approximationerror by passing to the convex hull.

5

Assume that there are at least two functions in F that almost minimize the loss Rin F (which, in the square loss case, is the same as almost minimizing the L2 distancebetween T and F ), and that these two functions are relatively “far away” in L2. Bythe parallelogram equality (or by a uniform convexity argument for a more generalloss function), if f1 and f2 are “almost minimizers” then

∥

∥

∥

∥

f1 + f2

2− T

∥

∥

∥

∥

2

L2(µ)

≤ 1

2‖f1 − T‖2

L2(µ) +1

2‖f2 − T‖2

L2(µ) −1

4‖f1 − f2‖2

L2(µ)

≈ ‖fF − T‖2L2(µ) −

1

4‖f1 − f2‖2

L2(µ).

Thus, if F1 is the set of all the almost minimizers in F of the distance to T and thediameter of F1 is large (to be precise, larger than c/

√n), the approximation error in

the convex hull of F1 is much smaller than in F . On the other hand, one can showthat if the diameter of F1 is smaller that c/

√n, the empirical minimization algorithm

in conv(F1) has a very fast error rate (because one has a very strong control over thevariance of the various loss functions associated with this set). Therefore, in bothcases - but for two completely different reasons – if f is the empirical minimizer in theconvex hull of F1 then ‖f−T‖2

L2(µ) ≤ ‖fF −T‖2L2(µ) +c(δ)(logM)/n, with probability

greater than 1 − δ.Naturally, using F1 is not realistic because it is impossible to identify the set of

almost true minimizers of the risk in F . However, it turns out that one can replaceF1 with a set that can be determined empirically and has similar properties to F1.The set defined in (1.2) satisfies that if its L2(µ) diameter is larger than c/

√n then

the gain in the approximation error in its convex hull is dramatic (compare with theone in F ), while if its diameter is smaller than c/

√n then empirical minimization

performed in its convex hull yields a very fast error rate.

3 Preliminaries from Empirical Processes Theory

Here, we will present some of the results we need for our analysis. The first of whichis Talagrand’s concentration inequality for empirical processes indexed by a class ofuniformly bounded functions.

Theorem 3.1 [7, 6] Let F be a class of functions defined on (Ω, µ) such that forevery f ∈ F , ‖f‖∞ ≤ b and Ef = 0. Let X1, ...,Xn be independent random variablesdistributed according to µ and set σ2 = n supf∈F Ef2. Define

Z = supf∈F

n∑

i=1

f(Xi) and Z = supf∈F

∣

∣

∣

∣

∣

n∑

i=1

f(Xi)

∣

∣

∣

∣

∣

.

6

Then, for every x > 0 and every ρ > 0,

Pr(

Z ≥ (1 + ρ)EZ + σ√Kx+K(1 + ρ−1)bx

)

≤ e−x,

P r(

Z ≤ (1 − ρ)EZ − σ√Kx−K(1 + ρ−1)bx

)

≤ e−x,

and the same inequalities hold for Z.

In our discussion, we will be interested in empirical processes indexed by a finiteclass of functions F and in the excess loss classes associated with F or with its convexhull, which is denoted by C.

Given a class G, the excess loss function associated with G and a function h is

LG(h)(X,Y ) = ℓ(h(X), Y ) − ℓ(hG(X), Y ),

where hG minimizes Eℓ(·, Y ) in G. Let LG(F ) = LG(f) : f ∈ F be the excess lossclass relative to G with a base class F .

In cases where the class G is clear and G = F , we denote the excess loss class byL and the excess loss function of h by Lh.

The following lemma is rather standard and we present its proof for the sake ofcompleteness.

Lemma 3.2 There exists absolute constants c1, c2 and c3 for which the followingholds. Let F be a finite class of functions bounded by b and set

d(F ) = diam(F,L2(µ)) and σ2(F ) = supf∈F

Ef2.

Then we have

E supf∈F

1

n

n∑

i=1

f2(Xi) ≤ c1 max

σ2(F ), b2log |F |n

. (3.1)

Also, assume that the target Y is also bounded by b and that the loss ℓ is a Lipschitzfunction on [−b, b]2 with a constant ‖ℓ‖lip. If C = conv(F ) and H = LC(C) then

E‖Pn − P‖H ≤ c3‖ℓ‖lip max

d(F ) ·√

log |F |n

, blog |F |n

. (3.2)

Proof. By the Gine-Zinn symmetrization argument [3] and the fact the a Bernoulliprocess is subgaussian with respect to the Euclidean metric, it is evident that for anyclass F ,

E‖Pn − P‖F ≤ 2

nEXEε sup

f∈F

∣

∣

∣

∣

∣

n∑

i=1

εif(Xi)

∣

∣

∣

∣

∣

≤ c1√n

EX

∫ r

0

√

logN(ε, F, Ln2 )dε,

7

where Ln2 is the L2 structure with respect to the random empirical measure n−1∑n

i=1 δXi

and

r2 = supf∈F

1

n

n∑

i=1

f2(Xi).

Set F 2 = f2 : f ∈ F and notice that by a symmetrization argument and thecontraction principle for Bernoulli processes (see, e.g. [7], Chapter 4),

Er2 ≤ c2 (E‖Pn − P‖F 2) + supf∈F

Ef2

≤ c3‖F‖∞E supf∈F

∣

∣

∣

∣

∣

1

n

n∑

i=1

εif(Xi)

∣

∣

∣

∣

∣

+ σ2(F ),

where ‖F‖∞ = supf∈F ‖f‖∞.Now, if F is a finite class of bounded functions then setting

E = E supf∈F

∣

∣

∣

∣

∣

1

n

n∑

i=1

εif(Xi)

∣

∣

∣

∣

∣

and using Dudley’s entropy integral, it is evident that

E ≤ c1√n

√

log |F |(EXr) ≤c4√n

√

log |F |(

‖F‖∞E + σ2(F ))1/2

.

Thus,

E ≤ c5 max

σ(F )

√

log |F |n

, ‖F‖∞log |F |n

,

and it follows that

E supf∈F

1

n

n∑

i=1

f2(Xi) ≤ c6 max

σ2(F ), ‖F‖2∞

log |F |n

, (3.3)

as claimed.Let C = conv(F ), set H = LC(C) and for each u ∈ C, put LC(u) = Lu. Also,

denote by fC the minimizer of Eℓ(·, Y ) in C.Recall that (e.g., [7] Chapter 4) there exists an absolute constant c7 such that for

every T ⊂ Rn,

E supt∈T

∣

∣

∣

n∑

i=1

εiti

∣

∣

∣≤ c7E sup

t∈T

∣

∣

∣

n∑

i=1

giti

∣

∣

∣,

where (gi)ni=1 are independent, standard Gaussian variables. Hence, for every (Xi, Yi)

ni=1,

Eε suph∈H

∣

∣

∣

∣

∣

n∑

i=1

εih(Xi, Yi)

∣

∣

∣

∣

∣

≤ c7Eg suph∈H

∣

∣

∣

∣

∣

n∑

i=1

gih(Xi, Yi)

∣

∣

∣

∣

∣

.

8

Consider the Gaussian process v → Zv ≡∑n

i=1 giLv(Xi, Yi) indexed by C. For everyv, u ∈ C,

E|Zu − Zv|2 =n∑

i=1

(Lu(Xi, Yi) − Lv(Xi, Yi))

≤ ‖ℓ‖2lip

n∑

i=1

((

u− fC)

−(

v − fC))2

(Xi) = ‖ℓ‖2lipE|Z ′

u − Z ′v|2,

where Z ′u ≡ ∑n

i=1 gi(u − fC)(Xi). Therefore, by Slepian’s Lemma (see, e.g. [2]), forevery (Xi, Yi)

ni=1,

Eg suph∈H

∣

∣

∣

∣

∣

n∑

i=1

gih(Xi, Yi)

∣

∣

∣

∣

∣

≤ ‖ℓ‖lipEg supv∈conv(F )

∣

∣Z ′v

∣

∣ = ‖ℓ‖lipEg supf∈F

∣

∣Z ′f

∣

∣ .

Applying Dudley’s entropy integral and (3.1) for the class (f − fC) : f ∈ F

E‖Pn − P‖H ≤ c8‖ℓ‖lip

√

log |F |n

(

EX supf∈F

‖f − fC‖2Ln

2

)1/2

≤ c9‖ℓ‖lip

√

log |F |n

max

supf∈F

‖f − fC‖L2(µ), b

√

log |F |n

≤ c10‖ℓ‖lip

√

log |F |n

max

d(F ), b

√

log |F |n

,

since by convexity, supf∈F ‖f − fC‖L2(µ) ≤ d(F ).

Lemma 3.2 combined with Theorem 3.1 leads to the following corollary.

Corollary 3.3 There exists an absolute constant c for which the following holds. LetF be a finite class of functions bounded by b. For every x > 0 and any integer n,let α =

√

(x+ log |F |)/n and set d(F ) = diam(F,L2(µ)). Then, with probability atleast 1 − exp(−x), the following holds: If C is the convex hull of F and ℓ and L areas above, then for every v ∈ C,

∣

∣

∣

∣

∣

1

n

n∑

i=1

LC(v)(Xi, Yi) − ELC(v)

∣

∣

∣

∣

∣

≤ c‖ℓ‖lip max

α · d(F ), bα2

.

The final result we need follows immediately from Bernstein’s inequality (see, e.g.,[14]) and its proof is omitted.

9

Lemma 3.4 There exists an absolute constant c for which the following holds. Con-sider F and α as above and let

L =

ℓ(f, Y ) − ℓ(fF , Y ) : f ∈ F

,

where fF minimizes Eℓ(·, Y ) in F . Then, with probability at least 1− 2 exp(−x), forevery f ∈ F , we have

∣

∣

∣

∣

∣

1

n

n∑

i=1

Lf (Xi, Yi) − ELf

∣

∣

∣

∣

∣

≤ c‖ℓ‖lip max

dfα, bα2

,

where, for every f ∈ F , df = ‖f − fF‖L2(µ). Also, with probability at least 1 −2 exp(−x), we have, for every f, g ∈ F ,

∣

∣

∣‖f − g‖2

Ln2

− ‖f − g‖2L2(µ)

∣

∣

∣≤ cmax

‖f − g‖L2(µ)bα, b2α2

.

4 The optimal aggregation procedure

Throughout this section, we will assume that F is a class of M functions boundedby b. We will also need certain assumptions on the loss ℓ and to that end, recallthe following definition, which originated from the notion of uniformly convexity ofnormed spaces.

Definition 4.1 [8] We say that a function f → Eφ(f(X)) is uniformly convex withrespect to the L2 norm if the function δφ, defined by

δφ(ǫ) = inff,g∈L2

‖f−g‖2≥ε

E (φ(f(X))) + E (φ(g(X)))

2− E

(

φ

(

f(X) + g(X)

2

))

(4.1)

is positive for every ε > 0. The function δφ is called the modulus of convexity of φ.

Assumption 4.1 Assume that ℓ is a Lipschitz function on [−b, b]2 with a constant‖ℓ‖lip. Assume further that there exists a convex function φ : R → R

+ such that, forany f, g ∈ L2(µ),

Eℓ(f(X), g(X)) = Eφ(f(X) − g(X))

and that the modulus of convexity δφ of f → Eφ(f(X)) satisfies δφ(ε) ≥ cφε2 for

every ε > 0.

For example, if ℓ(x, y) = |x − y|p for 1 < p ≤ 2 then φ(x) = |x|p and δφ(ε) ≥[(p − 1)/4]ε2 (cf. [12]).

10

We will denote ℓf = ℓ(f, Y ), R(f) = Eℓf and if f is a function of the sample Dthen R(f) = E

(

ℓf |D)

. Finally, recall that α = ((x + logM)/n)1/2, where x is thedesired confidence.

The procedure we have in mind is as follows. We consider a sample D = (Xi, Yi)2ni=1

and split it to two sub-samples, D1 = (Xi, Yi)ni=1 and D2 = (Xi, Yi)

2ni=n+1. We use D1

to define a random subset of F :

F1 =

f ∈ F : Rn(f) ≤ Rn(f) + C1 max

α‖f − f‖Ln2, α2

, (4.2)

where C1 is a constant, to be named later, that depends only on ‖ℓ‖lip and b, Rn(f) =

n−1∑n

i=1 ℓ(f(Xi), Yi), f is a minimizer of the empirical risk Rn(·) in F and Ln2 is theL2 space endowed by the random empirical measure n−1

∑ni=1 δXi

.Let us remark that to make the presentation of our results easier to follow, we

avoided presenting the computation of explicit values of constants. Our computationshowed that one can take C1 = 4‖ℓ‖lip(1 + 9b) - which, of course, is not likely to bethe optimal choice of C1.

The second step in the algorithm is performed using the second part of the sampleD. The algorithm produces the empirical minimizer (relative to D2) of ℓ in the convexhull of F1. Let us denote this minimizer by f , that is

f = argminh∈conv(F1)

1

n

2n∑

i=n+1

ℓ(h(Xi), Yi).

The main result of this section was formulated in Theorem B:

Theorem 4.2 For every b and ‖ℓ‖lip there exists a constant C2, depending only onb and ‖ℓ‖lip, for which the following holds. For any x > 0, every class F of Mfunctions, any target Y (all bounded by b) and any loss ℓ satisfying Assumption 4.1,the empirical minimizer f over the convex hull of F1 satisfies, with ν2n-probability atleast 1 − 2 exp(−x),

R(f) = E

(

ℓf |(Xi, Yi)2ni=1

)

≤ minf∈F

R(f) + C2(1 + x)logM

n,

Remark 4.3 Note that the definition of the set F1, and thus the algorithm, dependson the confidence x one is interested in through the factor α. Thus f also depends onthe confidence.

As we mentioned in the introduction, our idea is based on constructing a set of“almost minimizers” in F – that is, functions whose “distance” from the target (asmeasured by R) is almost optimal. Then, one has to consider two possibilities: if the

11

diameter of that set is small, the empirical minimization algorithm will preform verywell on its convex hull, giving us the fast error rate we hope for. On the other hand,if the diameter of that set is large, there will be a major gain in the approximationerror by considering functions in the convex hull. We will show that the set F1 is anempirical version of the set we would have liked to have.

Lemma 4.4 There exists an absolute constant c for which the following holds. LetF , x, α and F1 be defined as above. Then, with νn-probability at least 1− 2 exp(−x),the best element fF in the class F belongs to F1 and any function f in F1 satisfies

R(f) ≤ R(fF ) + cmax

αd(F1), (1 + b)α2

,

where d(F1) = diam(F1, L2(µ)).

Proof. Let Lf be the excess loss function associated with f (relative to F ) and recallthat fF minimizes Eℓ(·, Y ) in F . By the second part of Lemma 3.4, with probabilityat least 1 − exp(−x), every f, g ∈ F satisfy

∣

∣

∣‖f − g‖2

Ln2

− ‖f − g‖2L2(µ)

∣

∣

∣≤ c1 max

‖f − g‖L2(µ)bα, b2α2

.

Hence, with that probability, for every f, g ∈ F , we have

‖f − g‖2L2(µ) ≤ c2 max

α2b2, ‖f − g‖2Ln

2

, (4.3)

and‖f − g‖2

Ln2

≤ c3 max

α2b2, ‖f − g‖2L2(µ)

.

Now, by the first part of Lemma 3.4, with probability at least 1 − exp(−x), anyfunction f in F satisfies

∣

∣

∣

∣

∣

1

n

n∑

i=1

Lf (Xi, Yi) − ELf

∣

∣

∣

∣

∣

≤ c4‖ℓ‖lip max

dfα, bα2

, (4.4)

where, for every f ∈ F , d2f = ‖f − fF‖2

L2(µ). Using (4.3), we have, with probability

at least 1 − exp(−x),

d2f ≤ c2 min

α2b2, ‖f − fF‖2Ln

2

,

implying that, with probability greater than 1 − 2 exp(−x), every f ∈ F satisfies∣

∣

∣

∣

∣

1

n

n∑

i=1

Lf (Xi, Yi) − ELf

∣

∣

∣

∣

∣


α‖f − fF‖Ln2, α2b

.

12

Recall that for every f ∈ F , ELf ≥ 0, and thus, with probability at least 1−2 exp(−x),

PnℓfF = Pnℓf − PnLf≤ Pnℓf − ELf +

∣

∣Pn(Lf ) − ELf∣

∣

≤ Pnℓf + c5‖ℓ‖lip max

α‖f − fF‖Ln2, α2b

.

This implies that, for a constant C1 chosen properly, fF belongs to F1 with probabilitygreater than 1 − 2 exp(−x).

Next, let d denotes the L2 diameter diam(F1, L2(µ)). Since, by the first part, withhigh probability fF ∈ F1, then on that event, for every f ∈ F1, df ≤ d. Note thatfor every f ∈ F and any sample (Xi, Yi)

ni=1,

R(f) = R(fF ) + (P − Pn)(Lf ) + (Pnℓf − PnℓfF )

≤ R(fF ) + (P − Pn)(Lf ) + (Rn(f) −Rn(f)).

Thus, by the definition of F1 and the uniform estimates we have on |(Pn−P )(Lf )| in(4.4), it is evident that with probability greater than 1− 2 exp(−x), every function fin F1 satisfies

R(f) ≤ R(fF ) + c4‖ℓ‖lip max

dfα, bα2

+ C1 max

‖f − f‖Ln2α,α2

.

To complete the proof, observe that on this event, f ∈ F1 and thus

‖f − f‖2Ln

2

≤ c3 max

α2b2, ‖f − f‖2L2(µ)

≤ c3

α2b2, d2

.

Now we may turn to the second part of the algorithm - empirical minimizationwith respect to D2 on the convex hull of F1 (which is, of course, independent of D2).Proof of Theorem 4.2. Fix x > 0 and let C1 denote the convex hull conv F1.By Lemma 4.4, we may assume that fF ∈ F1 and set d = diam(F1, L2(µ)). SincefF ∈ F1 then

maxf∈F1

‖f − fF‖L2(µ) ≤ d ≤ 2maxf∈F1

‖f − fF‖L2(µ), (4.5)

and denote by f1 a function in F1 that maximizes ‖f − fF‖L2(µ) in F1.Consider the second half of the sample D2 = (Xi, Yi)

2ni=n+1. On one hand, by

Corollary 3.3, with probability at least 1− exp(−x) (relative to D2), for every v ∈ C1

∣

∣

∣

∣

∣

1

n

2n∑

i=1+n

LC1(v)(Xi, Yi) − ELC1

(v)

∣

∣

∣

∣

∣


dα, bα2

,

13

where LC1(v) = ℓ(v, Y ) − ℓ(f C1 , Y ) is the excess loss function relative to C1 and f C1

is the minimizer of Eℓ(·, Y ) in C1. Since f minimizes the empirical risk in C1 on D2

then 1n

∑2ni=n+1 LC1

(f)(Xi, Yi) ≤ 0. Therefore,

R(f) ≤ R(f C1) + ELC1(f) − 1

n

2n∑

i=n+1

LC1(f)(Xi, Yi) (4.6)

≤ R(f C1) + c1‖ℓ‖lip max

dα, bα2

= R(fF ) +(

c1‖ℓ‖lip max

dα, bα2

−(

R(fF ) −R(f C1)))

≡ R(fF ) + β,

and it remains to show that β ≤ c(x) logMn .

To that end, we shall bound R(fF ) − R(f C1) using the convexity properties ofℓ (Assumption 4.1). Indeed, recall fF ∈ F1 (with high probability w.r.t. D1) andf1 ∈ F1 maximizes the L2(µ) distance to fF in F1. Consider the mid-point f2 ≡(f1 + fF )/2 ∈ C1. By our convexity assumption on the loss, every functions u and vsatisfy

Eφ

(

u+ v

2

)

≤ 1

2Eφ(u) +

1

2Eφ(v) − δφ(‖u− v‖L2(µ)).

In particular, for u = fF − Y and v = f1 − Y , the mid-point is (u+ v)/2 = f2 − Y .Hence, using the assumption on δφ, it satisfies

R(f2) = Eℓ(f2, Y ) = Eφ

(

f1 + fF

2− Y

)

≤ 1

2Eφ(f1) +

1

2Eφ(fF ) − δφ(‖f1 − fF‖L2

)

≤ 1

2R(fF ) +

1

2R(f1) − cφ

d2

4.

By Lemma 4.4, the function f1 ∈ F1 satisfies

R(f1) ≤ R(fF ) + c2 max

αd, (1 + b)α2

,

implying that

R(f C1) ≤ R(f2) ≤ R(fF ) + c3 max

αd, (1 + b)α2

− c4d2.

Thus,

β = c1‖ℓ‖lip max

dα, bα2

−(

R(fF ) −R(f C1))


αd, (1 + b)α2

− c4d2.

It is clear that, if d ≥ (c6‖ℓ‖lip + b)α then β ≤ 0, otherwise β ≤ c7(‖ℓ‖lip + b)α2.

14

5 The lower bound

Here, we will present an example that shows that empirical minimization, over theconvex hull, is very far from being an optimal aggregation method. For every integern, we will construct a function class Fn with M = c1

√n functions for which, with

probability greater than 1 − exp(−c2√n), the empirical minimizer f in C = conv(F )

satisfiesR(f) ≥ R(fF ) +

c2√n,

where c1 and c2 are absolute constants.Let Ω = [0, 1] endowed with the Lebesgue measure µ and set L2 to be the

corresponding L2 space. Let (φi)∞i=1 be a realization of independent, symmetric,

−1, 1-valued random variables as functions on [0, 1] (for example, (φi)∞i=1 are the

Rademacher functions). In particular, (φi)∞i=1 is an orthonormal family in L2 consist-

ing of functions bounded by 1. Moreover, the functions (φi)∞i=1 are independent and

have mean zero.Let M be an integer to be specified later and put ℓ = (x− y)2. Consider

F = 0,±φ1, ....,±φM

and let Y = φM+1 which is a noiseless target function. A sample is (Xi, Yi)ni=1 =

(Xi, φM+1(Xi))ni=1 where the Xi’s are selected independently according to µ. It is

standard to verify that

C = conv(F ) =

M∑

j=1

λjφj,

M∑

j=1

|λj | ≤ 1

,

and that the true minimizers fF = fC = 0; in particular, R(fF ) = R(fC) and thereis no gain in the approximation error by considering functions in the convex hull C.Also, the excess loss function of a function f , relative to F and to C, satisfies

Lf = (f − φM+1)2 − (0 − φM+1)

2 = f2 − 2fφM+1.

Let Φ(x) = (φi(x))Mi=1 and set

⟨

·, ·⟩

to be the standard inner product in ℓM2 =(RM , ‖ · ‖). Observe that Φ is a vector with independent −1, 1 entries, and thus,

for every λ ∈ RM , E

⟨

λ,Φ⟩2

= ‖λ‖2.Let λ ∈ R

M . If we set fλ =⟨

λ,Φ⟩

then, since fλ and φM+1 are independent andEφM+1 = 0, the excess risk of fλ satisfies

ELfλ= Ef2

λ = ‖λ‖2.

A significant part of our analysis is based on concentration properties of sums ofrandom variables that belong to an Orlicz space.

15

Definition 5.1 For any α ≥ 1 and any random variable f , the ψα norm of f is

‖f‖ψα= infC > 0 : E exp(|f |α/Cα) ≤ 2.

The ψα norm measures the tail behavior of a random variable. Indeed, one can showthat (see, for example, [14]), for every u ≥ 1,

Pr(|f | > u) ≤ 2 exp(−cuα/‖f‖αψα),

where c is an absolute constant, independent of f .The following lemma is a ψ1 version of Bernstein’s inequality (see, for instance,

[14]).

Lemma 5.2 Let Y, Y1, ..., Yn be i.i.d random variables with ‖Y ‖ψ1< ∞. Then, for

any u > 0,

Pr(∣

∣

∣

1

n

n∑

i=1

Yi − EY∣

∣

∣> u‖Y ‖ψ1

)

≤ 2 exp(

−C3nmin(

u2, u))

, (5.1)

where C3 > 0 is an absolute constant.

In the next lemma, we will present simple ψ1 estimates for f2 and the resultingdeviation inequalities using Lemma 5.2

Lemma 5.3 There is an absolute constant C4 for which the following holds. Forevery λ ∈ R

M , ‖f2λ‖ψ1

≤ C4‖λ‖2 and for every u > 0,

Pr

(

∣

∣

∣

1

n

n∑

i=1

f2λ(Xi) − Ef2

λ

∣

∣

∣≥ uC4‖λ‖2

)

≤ 2 exp(−C3nminu2, u). (5.2)

Proof. Fix λ ∈ RM . Note that, by using Hoffding’s inequality (see, for example,

[14]) and the fact that (φi)Mi=1 are independent and symmetric Bernoulli variables, we

have for every u > 0,

Pr(∣

∣

∣

M∑

j=1

λjφj

∣

∣

∣> u‖λ‖

)

≤ 2 exp(−u2/2).

Hence, ‖⟨

λ,Φ⟩

‖ψ2≤ c1‖λ‖ for some absolute constant c1. The first part of the lemma

follows with‖f2λ‖ψ1

= ‖⟨

λ,Φ⟩2‖ψ1

= ‖⟨

λ,Φ⟩

‖2ψ2

≤ c21‖λ‖2.

The second part of the claim follows from the first one and Lemma 5.2.

16

Lemma 5.3 allows us to control the deviation of the empirical Ln2 norm from theactual L2 norm for a large number of functions in a subset of fλ : λ ∈ SM−1. Thesubset we will be interested in is a maximal ε-separated subset of SM−1 for the rightchoice of ε < 1.

Lemma 5.4 There exist absolute constants C5, C6 C7 and C8 for which the followingholds. For any n ≥ C5M , with µn-probability at least 1 − 2 exp(−C6n), for anyλ ∈ R

M ,

1

2‖λ‖2 ≤ 1

n

n∑

i=1

f2λ(Xi) ≤

3

2‖λ‖2.

Also, for every r > 0, we have, with µn-probability at least 1 − 6 exp(−C6M),

C7

√

rM

n≤ sup

λ:‖λ‖≤√r

1

n

n∑

i=1

fλ(Xi)φM+1(Xi) ≤ C8

√

rM

n.

Proof. The proof of the first part is standard and we sketch it for the sake ofcompleteness. Since fλ =

⟨

λ,Φ⟩

, what we wish to prove is that with high probability,

supλ∈SM−1

∣

∣

∣

∣

∣

1

n

n∑

i=1

⟨

λ,Φ(Xi)⟩2 − 1

∣

∣

∣

∣

∣

≤ 1

2,

where SM−1 is the unit sphere in ℓM2 .By a standard successive approximation argument (see, for example [11]), it is

enough to prove that any point x, in a maximal ε-separated subset Nε of SM−1 (foran appropriate choice of ε), satisfies

∣

∣

∣

∣

∣

1

n

n∑

i=1

⟨

x,Φ(Xi)⟩2 − 1

∣

∣

∣

∣

∣

≤ δ.

(where ε and δ depend only on the constant 1/2).A volumetric estimate [11] shows that the cardinality of Nε is at most (5/ε)M .

Hence, if we take u = δ/C4 in (5.2), then

Pr(

∃x ∈ Nε :∣

∣

∣

1

n

n∑

i=1

⟨

x,Φ(Xi)⟩2 − 1

∣

∣

∣≥ δ)

≤(

5

ε

)M

· 2 exp(−C3nδ2/C2

4 ).

≤ 2 exp(−c0n)

as long as n ≥ c1(ε, δ)M .

17

Turning to the second part, since Φ is a vector of independent, symmetric Bernoullivariables and φM+1 is also a symmetric Bernoulli variable, independent of the others,the supremum supλ:‖λ‖≤√

r∑n

i=1 fλ(Xi)φM+1(Xi) has the same distribution as

supλ:‖λ‖≤√

r

n∑

i=1

εi⟨

λ,Wi

⟩

= (∗),

where (εi)ni=1 are symmetric Bernoulli variables that are independent of (Wi)

ni=1,

that are independent random vertices of −1, 1M . Clearly, for every 1 ≤ i ≤ n,‖Wi‖2 = M , and by the Kahane-Khintchine inequality (see, e.g., [7])

Eε(∗) =√rEε‖

n∑

i=1

εiWi‖ ≥ c2√r

(

n∑

i=1

‖Wi‖2

)1/2

= c2√rnM.

Also,

Eε(∗) ≤(

Eε(∗)2)1/2 ≤

√rnM.

To obtain the high probability estimate, we use the concentration result for vectorvalued Rademacher processes (see, for example, [7], Chapter 4). Consider the ℓM2 -valued variables Z =

∑ni=1 εiΦ(Xi) and Z ′ which is Z conditioned on X1, ...,Xn. By

the first part of our claim, for any n ≥ c1M there is a set A with probability at least1 − 2 exp(−c0n) on which, for every λ ∈ R

M ,∑n

i=1

⟨

λ,Φ(Xi)⟩2 ≤ (3/2)n‖λ‖2. Thus,

on the set A,

σ2(Z ′) ≡ supθ∈SM−1

Eε

⟨

Z ′, θ⟩2

= supθ∈SM−1

n∑

i=1

⟨

θ,Φ(Xi)⟩2 ≤ 3

2n,

implying that for any u > 0,

Pr(∣

∣‖Z ′‖ − Eε‖Z ′‖∣

∣ ≥ u√n)

≤ 4 exp(−c3u2),

where c3 is an absolute constant. Since n ≥ c1M and Eε(∗) =√rEε‖Z ′‖, then if one

takes u = (c2√M/2), it follows that there is an absolute constant c4 for which with

probability at least 1 − 4 exp(−c4M),

c22

√

rM

n≤ sup

λ:‖λ‖≤√r

1

n

n∑

i=1

εi⟨

λ,Φ(Xi)⟩

≤ 2

√

rM

n.

Therefore, combining the two high probability estimates and since

f : ELf ≤ r = fλ : ‖λ‖ ≤√r,

it is evident that, with probability greater than 1 − 6 exp(−c5M),

c22

√

rM

n≤ sup

λ:‖λ‖≤√r

1

n

n∑

i=1

(fλφM+1)(Xi) ≤ 2

√

rM

n.

18

Now, we can formulate and prove the main result of this section, which willcomplete the proof of Theorem A.

Theorem 5.5 There exist absolute constants c1, c2 and c3 for which the followingholds. Take F and C defined as above and M such that n = c1M

2. The empiricalminimizer f in C satisfies, with probability at least 1 − 8 exp(−c2

√n),

ELf ≥ c3√n.

In particular, for that choice of M and n, with that probability, empirical minimizationperformed in C satisfies

R(f) ≥ minf∈F

R(f) +c3√n.

Proof. Fix fλ =∑M

j=1 λjφj ∈ C = conv(F ) and recall that SM−1 is the unit sphere

in ℓM2 . Note that ELfλ=∑M

i=1 λ2i and thus

Lr ≡ Lf : f ∈ C,ELf = r = Lfλ: λ ∈ BM

1 ∩√rSM−1 = Lfλ

: λ ∈√rSM−1,

provided that r ≤ 1/M .Since Lf = f2 − 2fφM+1 then, for every r ≤ 1/M ,

infLf∈Lr

PnLf = r − supλ∈√rSM−1

(P − Pn)Lfλ

= r − supλ∈√rSM−1

((

Ef2λ −

1

n

n∑

i=1

f2λ(Xi)

)

− 1

n

n∑

i=1

(−2fλΦM+1)(Xi)

)

≤ r + supλ∈√rSM−1

∣

∣

∣

∣

∣

Ef2λ −

1

n

n∑

i=1

f2λ(Xi)

∣

∣

∣

∣

∣

− 2 supλ∈√rSM−1

1

n

n∑

i=1

(fλΦM+1)(Xi).

Applying both parts of Lemma 5.4, it follows that if n ≥ C5M then, with probabilityat least 1 − 8 exp(−C6M),

infLf∈Lr

PnLf ≤ 3

2r − 2C7

√

rM

n=

√r

(

3

2

√r − 2C7

√

M

n

)

.

Consider n = c1M2 and note that the condition that n ≥ C5M is satisfied. Hence,

√

M/n = 1/√c1M and there are absolute constants c2 < 1 and c3, such that for

r ≤ c2/M , PnLf ≤ −c3√

r/M . Thus, we fix r = c2/M = (c2c1/21 )/

√n.

19

On the other hand, combining the upper bounds from Lemma 5.4, it follows thatfor every ρ > 0, with probability at least 1 − 8 exp(−C6M),

supλ:‖λ‖≤√

ρ

∣

∣

∣

∣

∣

1

n

n∑

i=1

Lfλ(Xi) − ELfλ

∣

∣

∣

∣

∣

≤ supλ:‖λ‖≤√

ρ

∣

∣

∣

∣

∣

1

n

n∑

i=1

f2λ(Xi) − Ef2

λ

∣

∣

∣

∣

∣

+ 2 supλ:‖λ‖≤√

ρ

∣

∣

∣

∣

∣

1

n

n∑

i=1

(fλφM+1)(Xi)

∣

∣

∣

∣

∣

≤ ρ

2+ 2C8

√

ρM

n=ρ

2+ c4

√

ρ

M.

Therefore, on that set, and for ρ ≤ c5/M ,

infλ:‖λ‖≤√

ρPnLfλ

≥ − supλ:‖λ‖≤√

ρ|(Pn − P )(Lfλ

)| ≥ −c6√

ρ

M.

Hence, with probability of at least 1−8 exp(−C6M), as long as c26ρ ≤ c23r/2, argminf∈CPnLfis a function fλ indexed by λ of norm larger than

√ρ and hence, with an excess risk

greater than√ρ. Therefore, taking ρ ∼ r and noting that r ∼ 1/

√n, there exists an

absolute constant c7 > 0 such that, with that probability,

E

(

Lf |(Xi)ni=1

)

≥ √ρ ≥ c7√

n.

References

[1] Olivier Catoni. Statistical learning theory and stochastic optimization, volume 1851 of LectureNotes in Mathematics. Springer-Verlag, Berlin, 2004. Lecture notes from the 31st SummerSchool on Probability Theory held in Saint-Flour, July 8–25, 2001.

[2] Richard M. Dudley. Uniform central limit theorems, volume 63 of Cambridge Studies in AdvancedMathematics. Cambridge University Press, Cambridge, 1999.

[3] Evarist Gine and Joel Zinn. Some limit theorems for empirical processes. Ann. Probab.,12(4):929–998, 1984.

[4] Anatoli B. Juditsky, Philippe Rigollet, and Alexandre B. Tsybakov. Learning by mirror averag-ing. To appear in the Ann. Statist.. Available at http://www.imstat.org/aos/future-papers.html,2006.

[5] Guillaume Lecue. Suboptimality of Penalized Empirical Risk Minimization in Classification.20th Annual Conference On Learning Theory, COLT07. Proceedings. Lecture Notes in ArtificialIntelligence, 4539:142–156, 2007. Springer, Heidelberg.

[6] Michel Ledoux. The concentration of measure phenomenon, volume 89 of Mathematical Surveysand Monographs. American Mathematical Society, Providence, RI, 2001.

20

[7] Michel Ledoux and Michel Talagrand. Probability in Banach spaces, volume 23 of Ergebnisseder Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)].Springer-Verlag, Berlin, 1991.

[8] Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity inlearning with squared loss. IEEE Trans. Inform. Theory, 44(5):1974–1980, 1998.

[9] Shahar Mendelson. Lower bounds for the empirical minimization algorithm. Technical report,2007.

[10] Arkadi Nemirovski. Topics in non-parametric statistics. In Lectures on probability theory andstatistics (Saint-Flour, 1998), volume 1738 of Lecture Notes in Math., pages 85–277. Springer,Berlin, 2000.

[11] Gilles Pisier. The volume of convex bodies and Banach space geometry, volume 94 of CambridgeTracts in Mathematics. Cambridge University Press, Cambridge, 1989.

[12] R. Tyrrell Rockafellar. Convex analysis. Princeton Mathematical Series, No. 28. PrincetonUniversity Press, Princeton, N.J., 1970.

[13] Alexandre B. Tsybakov. Optimal rates of aggregation. 16th Annual Conference On Learn-ing Theory, COLT03. Proceedings. Lecture Notes in Artificial Intelligence, 2777:303–313, 2003.Springer, Heidelberg.

[14] Aad W. van der Vaart and Jon A. Wellner. Weak convergence and empirical processes. SpringerSeries in Statistics. Springer-Verlag, New York, 1996.

21

aggregation versus empirical risk minimization -...

Documents