car - statweb.stanford.edudonoho/reports/1995/cartbob.pdf · terminal no des in a data-driv ... e...

39

Upload: trinhngoc

Post on 26-Mar-2018

216 views

Category:

Documents


3 download

TRANSCRIPT

CART and Best-Ortho-Basis:

A Connection

David L. Donoho

Statistics

Stanford University

July 1995

Abstract

We study what we call \Dyadic CART" { a method of nonparametric regressionwhich constructs a recursive partition by optimizing a complexity-penalized sum ofsquares, where the optimization is over all recursive partitions arising from midpointsplits. We show that the method is adaptive to unknown degree of anisotropic smooth-ness. Speci�cally, consider the \mixed smoothness" classes consisting of bivariate func-tions f(x1; x2) whose �nite di�erence of distance h in direction i is bounded in Lp normby Ch�i, i = 1; 2. We show that Dyadic CART, with an appropriate complexity penaltyparameter � � �2 � Const � log(n), is within logarithmic terms of minimax over everymixed smoothness class 0 < C <1, 0 < �1; �2 � 1.

The proof shows that Dyadic CART is identical to a certain adaptive Best-Ortho-Basis Algorithm based on the library of all anisotropic Haar bases. Then it appliesempirical basis selection ideas of Donoho and Johnstone (1994). The basis empiricallyselected by Dyadic CART is shown to be nearly as good as a basis ideally adapted tothe underlying f . The risk of estimation in an ideally adapted anisotropic Haar basisis shown to be comparable to the minimax risk over mixed smoothness classes.

Underlying the success of this argument is harmonic analysis of mixed smoothnessclasses. We show that for each mixed smoothness class there is an anisotropic Haarbasis which is a best orthogonal basis for representing that smoothness class; the basis isoptimal not just within the library of anisotropic Haar bases, but among all orthogonalbases of L2[0; 1]2.

Key Words and Phrases. Wavelets, Mixed Smoothness, Anisotropic Haar Basis,Best Orthogonal Basis, Minimax Estimation, Spatial Adaptation, Oracle Inequalities.

Acknowledgements. This paper was stimulated by some interesting conversationsabout CART and wavelets which the author had with Joachim Engel at Oberwolfach,March 1995. It is a pleasure also to acknowledge conversations with R.R. Coifman andI.M. Johnstone.

The author is also at U.C. Berkeley (on leave). This research was partially supportedby NSF DMS-92-09130.

Thanks to Helen Tombropoulos for an e�cient and enthusiastic typing job.

1

1 Introduction

The CART methodology of tree-structured adaptive non-parametric regression [4] has beenwidely used in statistical data analysis since its inception more than a decade ago. Builtaround ideas of recursive partitioning, it develops, based on an analysis of noisy data, apiecewide constant reconstruction, where the pieces are terminal nodes of a data-drivenrecursive partition.

The Best-Ortho-Basis methodology of adaptive time-frequency analysis [5] has, morerecently, caught the interest of a wide community of applied mathematicians and signalprocessing engineers. Based on ideas of recursive-partitioning of the time-frequency plane,it develops, from an analysis of a given signal, a segmented basis, where the segments areterminal nodes in a data-driven recursive segmentation of the time axis.

Both methods are concerned with recursive dyadic segmentation; therefore trees andtree pruning are key data structures and underlying algorithms in both areas. In addition,there is a mathematical connection between the areas.

Sudeshna Adak, currently a graduate student at Stanford University, has pointed outthat central algorithms in the two subjects are really the same: namely the optimal pruningalgorithm in Theorem 10.7, page 285, in the CART book [4] and in the Proposition onpage 717, in the Best-Basis paper [6]. Both theorems assert that, given a function E(T )which assigns numerical values to a binary tree, and its subtrees, and supposing that thefunction obeys a certain additivity property, the optimal subtree is obtained by breadth-�rst, bottom-up pruning of the complete tree.

On the other hand, the subjects are also di�erent, since in one case (CART) one issearching for an optimal function on a multidimensional cartesian product domain, and inthe other one (BOB) is searching for an optimal orthogonal basis for the vector space of1� d signals of length n.

This paper will exhibit a precise connection between CART and BOB in a speci�csetting | where one is seeking an optimal function/basis built from rectangular blockson a product domain. In this setting we show that certain speci�c variants of the twoapparently di�erent methodologies lead to identical fast algorithms and identical solutions.

1.1 An implication

The connection between CART and Best-Basis a�ords new insights about recursive parti-tioning methods. Recently, Donoho and Johnstone have investigated the use of adaptivelychosen bases for noise removal [11]. They have developed so-called oracle inequalities whichshow that certain schemes for basis selection in the presence of noisy data will work well.By adapting such ideas from the Best-Basis setting to the CART setting, we are able toestablish new results on the performance of optimal dyadic recursive partitioning. In par-ticular, we are able to show that such methods can be nearly-minimax simultaneously overa wide range of mixed smoothness spaces.

We assume observations of the form

y(i1; i2) = �f(i1; i2) + �z(i1; i2) 0 � i1; i2 < n; (1.1)

where n is dyadic (a integral power of 2), zi1;i2 is a white Gaussian noise, and � > 0 is anoise level. We assume the observations are related to the underlying f by cell averaging;

�f(i1; i2) = Aveff j[i1=n; (i1+ 1)=n)� [i2=n; (i2 + 1)=n)g: (1.2)

2

Our goal is to recover the de-noised cell averages with small mean squared error Ekf ��fk2`2 = E

Pi1;i2

(f(i1; i2)� �f(i1; i2))2. About f we will assume that it belongs to a certain

class F , and we will compare performance of estimates with the best mean-squared erroravailable uniformly over the class F { i.e. the minimax risk:

M�(�; n;F) = inff(�)

supf2F

Ekf(y)� �fk2`2 : (1.3)

For our F we consider mixed smoothness classes F �1;�2p (C) consisting of functions on

[0; 1]2 obeying kD1hfkp � Ch�1 , kD2

hfkp � Ch�2 , for all h 2 (0; 1), where Dih denotes the

�nite di�erence of distance h in direction i. We letMS denote the scale of all such classes,where 0 < p <1, 0 < �1; �2 � 1 and 0 < C <1.

Our main result:

Theorem 1.1 Dyadic CART (de�ned in section 2 below), with the speci�c complexity

penalty � = �(�; loge(n)) de�ned in section 7 below (� � �2 loge(n)), comes within log-

arithmic factors of minimax over each functional class F �1;�2p (C), 0 < �1; �2 � 1; C >

0; p 2 (0;1). If f�;� denotes the dyadic CART estimator

supF

Ekf�;� � �fk2`2 � Const(�1; �2; p) � log(n) �M�(�; n;F) as n!1: (1.4)

for each F 2 MS.

In short, the estimator behaves nearly as well over any class in the scaleMS as one couldachieve knowing precisely which smoothness class were true. However, the construction ofthe optimal recursive partitioning estimator requires no knowledge of which smoothnessclass might actually be the case. (Indeed, we are unaware of any previous literature sug-gesting a connection between such smoothness classes and CART).

We remark that no similar adaptivity is possible using standard isotropic wavelets orisotropic Fourier analysis. This illustrates a theoretical bene�t to using recursive parti-tioning in a setting where objects may possess di�erent degrees of smoothness in di�erentdirections.

1.2 Plan of the Paper

In sections 2 through 6 we develop the connection between CART methods and Best Basismethods. Section 2 de�nes Dyadic CART and describes its fast algorithm. Section 3de�nes a library of Anisotropic Haar Bases and describes a fast algorithm for �nding aBest Anisotropic Haar basis from given data, where \best" is de�ned in the Coifman-Wickerhauser sense. In Sections 4 and 5, building on an insight of Joachim Engel (1994),we point out that, with traditional choices of entropy, Best Ortho Basis is di�erent fromCART, but that with a special Hereditary Entropy the two methods are the same.

In sections 7 and 8 we discuss ideas �rst developed in the best basis setting. Section 7develops oracle inequalities, which show how to select a basis empirically from noisy datato yield a basis that is nearly as good as the ideal basis which could be designed based onnoiseless data. Section 8 describes the best-basis problem for mixed smoothness classes,and shows that a certain kind of anisotropic Haar basis is, in a certain sense, a best basis.

In section 9, building on sections 7 and 8, we showthat a certain best-basis de-noisingtechnique introduced by Donoho and Johnstone [11] | which is di�erent from CART | is

3

nearly minimax over the scale of mixed smoothness classes. Section 10 establishes our mainresult for CART by comparing the CART estimator with this best-basis de-noising method,and showing that the two estimates have comparable performance over mixed smoothnessspaces. Section 11 discusses comparisons and generalizations.

2 Dyadic CART

We change notation slightly from (1.1). We suppose we observe noisy 2-dimensional dataon a regular square n� n array of \pixels",

y(i1; i2) = f(i1; i2) + � � z(i1; i2); 0 � i1; i2 < n; (2.1)

where (in a change from the last section) f is the object of interest { an n-by-n array { andz is a standard Gaussian white noise (i.i.d. N(0; 1)). We also introduce a fruitful abuse ofnotation. We write [0; n) for the discrete interval f0; � � � ; n� 1g. Thus [0; n)2 is a discretesquare, etc. Here and below we also write i = (i1; i2), so y(i) = f(i)+�z(i), for i 2 [0; n)2 isan equivalent form of (2.1). Finally, we use the variable N = n2 to stand for the cardinalityof the n-by-n array y.

In this setting, the CART methodology constructs a piecewise constant estimator f off ; data-adaptively, it builds a partition P of [0; n)2 and �nds f by the rule

f(ijP) = Avefy j R(i;P)g (2.2)

where R(i;P) denotes the rectangle of the partition P containing i.

2.1 Optimal Dyadic CART

There are several variants of CART, depending on the procedure used to construct thepartitition P . In this paper, we are only interested in optimal (non-greedy) dyadic recursivepartitioning. With an acknowledged risk of misunderstanding, we call this dyadic CART.We de�ne terms.

Dyadic Partitioning. Starting from the trivial partition P0 = f[0; n)2g we may generatenew partitions by splitting [0; n)2 into two pieces either vertically or horizontally, yield-ing either the partitition f[0; n=2) � [0; n); [n=2; n) � [0; n)g or f[0; n) � [0; n=2); [0; n)�[n=2; n)g. We can apply this splitting recursively, generating other partitions. Thus, letP = fR1; � � � ; Rkg be a partition and let R stand for one of the rectangles in the parti-tion. We can create a new partition by splitting R in half horizontally or vertically. IfR = [a; b)� [c; d) then let R1;0 and R1;1 denote the results of horizontal splitting, i.e.

R1;0 = [a1(a+ b)=2)� [a; d];

R1;1 = [(a+ b)=2; b)� [c; d];

while we let R2;0 and R2;1 denote the results of vertical splitting

R2;0 = [a; b)� [c; (c+ d)=2);

R2;1 = [a; b)� [(c+ d)=2; d):

Note that if b = a + 1 or d = c + 1 then horizontal/vertical splitting is not possible: onlynonempty rectangles are allowed.

4

As an example, if we split vertically the rectangle R = R`, say, we produce the k + 1-element partition fR1; � � � ; R`�1; R

2;0` ; R

2;1` ; R`+1; � � � ; Rkg.

A recursive dyadic partition is any partition reachable by successive application of theserules.

Optimal Partitions. CART is often used to refer to \greedy growing" followed by \op-timal pruning", where the partition P is constructed in a heuristic, myopic fashion. Forthe purposes of this paper, we consider instead the use of optimizing partitions, where thedyadic partition P is constructed as the optimum of the complexity-penalized residual sum

of squares. Thus, with

CPRSS(P; �) = ky � f(�jP)k2`2N

+ �#(P); (2.3)

what we will call (again in perhaps a slight abuse of nomenclature) dyadic CART seeks thepartition

P� = argminP

CPRSS(P; �) (2.4)

The idea of using globally optimal partitions is covered in passing in [4, Chapter 10].For the moment we let � be a free parameter; in section 7 below we will propose a speci�cchoice.

Dyadic CART di�ers from what is usually called CART, in that dyadic CART can splitrectangles only in half, while general CART can split rectangles in all proportions. Whilethe extra exibility of general CART may be useful, this exibility is su�cient to makethe �nding of an exactly optimal partition unwieldy. Dyadic CART allows a more limitedrange of possible partitions, which makes it possible to �nd an optimal partition in orderO(N) time.

2.2 Fast Optimal Partitioning

To describe the algorithm, we introduce some notation.Rectangles. We use generically I to denote dyadic intervals, i.e. intervals I = [a; b) with

a = n � k=2j and b = n � (k+ 1)=2j with n � 2j and 0 � k < 2j . We use R to denote dyadicrectangles, i.e., rectangles I1 � I2.

Parents and Siblings. Two dyadic rectangles are siblings if their union is a dyadicrectangle. This is equivalent to saying that we can write either

Ri = Ii � I0 i = 1; 2; (2.5)

orRi = I0 � Ii i = 1; 2; (2.6)

where I0; I1; I2 are dyadic intervals and

I1 = [n � 2k=2j; n � (2k+ 1)=2j); I2 = [n � 2(k + 1)=2j; n � (2k+ 2)=2j); (2.7)

with 0 � k < 2j�1; 0 � j < log2(n)� 1: A pair satisfying (2.5) is a pair of horiz-sibs; a pairsatisfying (2.6) is a pair of vert-sibs.

The union of two siblings is the parent rectangle. Each rectangle generally has twosiblings | a vert-sib and a horiz-sib | and two parents | a vert-parent and an horiz-parent. Parents generally have two sets of children: a pair of horiz-kids and a pair of

5

vert-kids. In extreme cases a rectangle may have only a vert-sib (if it is very wide, such as[0; n)� [0; n=2)), or only an horiz-sib (if it is very tall, such as [0; n=2)� [0; n)). In somecases a rectangle may have only vert-kids (if it is very narrow, such as [0; 1)� [0; n=2)) oronly horiz-kids (if it is very short, such as [0; n=2)� [0; 1)).

Inheritance. CPRSS has an \inheritance property" which we more easily see by takinga general point of view. Let CART (R) denote the problem of �nding the optimal partitionfor just the data falling in the dyadic rectangle R:

(CART (R)) P(R) = argmin ky � f(�jP(R))k2`2(R) + � �#(P(R)):

Here P(R) denotes a recursive dyadic partition of R, and k � k2`2(R)

refers to the sum-of-squares only of data falling in the rectangle R.

Here is the inheritance property of optimal partitions. Let R be a dyadic rectangle andsuppose it has both vert-children and horiz-children. Then the optimal partition of R iseither: (i) the trivial partition fRg, or (ii) the union of optimal partitions of the horiz-kidsP(R1;0)[P(R1;1), or (iii) the union of optimal partitions of the vert-kids P(R2;0)[P(R2;1).Which of these three cases holds can be determined by holding a \tournament", selectingthe winner as the smallest of the three numbers

ky �AvefyjRgk2`2(R); CART (R1;0) + CART (R1;1); CART (R2;0) + CART (R2;1):

The exception to this rule is of course at the �nest scale: a 1-by-1 rectangle has nochildren, and so the optimal partition of such an R is just the trivial partition fRg.

By starting from the next-to-�nest scale and applying the inheritance property, we canget the optimal partitions of all 2-by-1 rectangles, and of all 1-by-2 rectangles. By going tothe next coarser level and applying inheritance, we can get the optimal partitions of all 4-by-1, of all 2-by-2 and of all 1 by 4 rectangles. And so on. Continuing in a �ne-to-coarse or`bottom-up' fashion, we eventually get to the coarsest level and obtain an optimal partitionfor [0; n)2.

There are � 2n dyadic intervals and hence � 4n2 = 4N dyadic rectangles. Each dyadicrectangle is visited once in the main loop of the algorithm and there are at most a certainconstant number C of additions and multiplications per visit. Total work: � C � 4 �N opsand � 16 �N storage locations. See the appendix for a formal description of the algorithm.

3 Best-Ortho-Basis

We now turn attention away from CART.We recall the standard notation for Haar functionsin dimension 1. Let I be a dyadic subinterval of [0; n) and let �I(i) = jI j�1=21I(i). If Icontains at least 2 points, set hI(i) = (1I(1)(i)� 1I(0)(i))jI j�1=2, where I(1) is the right halfof I and I(0) is the left half of I .

Using these, we can build anisotropic Haar functions in 2-dimensions. Let R be a dyadicrectangle I1 � I2; we can form 3 atoms

�0R(i1; i2) = �I1(i1)�I2(i2);

�1R(i1; i2) = hI1(i1)�I2(i2);

�2R(i1; i2) = �I1(i1)hI2(i2):

These are naturally associated with the rectangle R: �0R is, up to scaling, the indicator ofR, while �1R and �2R are associated with horizontal and vertical midpoint splits of R.

6

Adapting terminology proposed by Mallat and Zhang in a di�erent setting, we call the�sR atoms, and the collection of all such atoms �sR indexed by (R; s), makes up a dictionaryof atoms. This dictionary is overcomplete; it contains � 3n2 � 3N atoms, while the spanof these elements is of dimension only N .

3.1 Anisotropic Haar Bases

Certain structured subcollections of the elements of D make up orthogonal bases. Thesesubcollections are in correspondence with complete recursive partitions, that is to say, recur-sive dyadic partitions in which all terminal nodes are 1-by-1 rectangles [i1; i1+1)�[i2; i2+1)containing a single point i = (i1; i2).

Given a complete recursive partition P�, the corresponding ortho basis B is constructedas follows. Let NT (P�) be the collection of all rectangles encountered at non-terminalstages of the recursive partitioning leading to P�. Let R 2 NT (P�). As R is nonterminalit will be further subdivided in forming P� | i.e. it will be split either horizontally orvertically; let s(R) = 1 or 2 according to the splitting variable chosen. Then de�ne B as

the collection of all such �s(R)R and �[0;n)2 :

B(P�) = f�[0;n)2g [ f�s(R)R gR2NT (P) (3.1)

Theorem 3.1 Let P� be a complete recursive dyadic partition of [0; n)2 and let B(P�) beconstructed as in (3.1). This is an orthobasis for the N -dimensional vector space of n � n

arrays.

Proof. Indeed, B has cardinality N , and the elements of B are normalized and pairwiseorthogonal. The pairwise orthogonality comes from two simple facts. Take any two distinctelements in B; then either they have disjoint support, or the support of one is included inthe support of the other. In the �rst instance, orthogonality is immediate; in the secondinstance orthogonality follows from two observations: (i) one element of the pair, call it�, is supported in a rectangle on which the other element, ' say, is constant; and (ii) theelement � has zero mean, and so is orthogonal to any function which is constant on itssupport, i.e. to '. 2

Each such Basis B has a fast transform, produced in a fashion similar to the Haartransform in dimension 1. Indeed the coe�cients in such a basis can be computed in termsof block averages and di�erences of block averages. If S(R) =

Pi2R y(i) denotes the sum

of values in a rectangle R, then of course

hy; �Ri = S(R) � jRj�1=2 (3.2)

while if (R1;0; R1;1) are horizontal kids of R

hy; �1Ri = (S(R1;1)� S(R1;0)) � jRj�1=2 (3.3)

and if (R2;0; R2;1) are vertical kids of R

hy; �2Ri = (S(R2;1)� S(R2;0)) � jRj�1=2: (3.4)

These relations are useful because there is a simple \Pyramid-of-Adders" for calculating all(S(R) : R 2 NT (P�)) in order N time. See the appendix for a formal description of thealgorithm.

7

3.2 Best Basis Algorithm

The collection of all anisotropic Haar bases and fast transforms makes for a potentiallyvery useful library. It contains bases associated with partitions which subdivide much more�nely in i1 than in i2 in some subsets of [0; n)2 and more �nely in i2 than in i1 in othersubsets. There is therefore the possibility of �nding bases very well adapted to certainanisotropic problems.

How to choose a \best-adapted" basis? In the general framework set up in the contextof cosine packet/wavelet packet bases by Coifman and Wickerhauser (1992), one speci�esan additive \entropy" functional of the vector � 2 RN

E(�) =NXi=1

e(�i) (3.5)

where e(t) is a scalar function. Coifman and Wickerhauser's original proposal was eCW (t) =�t2 log(t2), but ep(t) = jtjp, where 0 < p < 2 also makes sense, as well as other choices|seebelow. One uses such a functional to evaluate the quality of a basis; if �(f;B) denotesthe vector of coe�cients of the object f in basis B, then E(�(f;B)) is a measure of theusefulness of a basis for representing f , and the best basis B in a library L of ortho basessolves the problem

minB2L

E(�(f;B)): (3.6)

In the speci�c case of interest, there are as many bases in the library as there are completerecursive partitions. Elementary arguments show that the number of bases is exponentialin N .

While such exponential behavior makes brute force calculation of the optimum in (3.6)practically impossible, judicious application of dynamic programming gives a practical al-gorithm.

In order to express the key analytic feature of the objective functional, we take a moregeneral point of view, and consider the problem of �nding an optimal basis for just the datafalling in the dyadic rectangle R. Each complete recursive dyadic partition of R, P�(R) say,leads to an anisotropic Haar basis, B(R) say, for the collection of n-by-n arrays supportedonly in R. Hence we can de�ne the optimization problem

(BOB(R)) B(R) = argminB(R)

~E(�(y;B(R))):

Here �(y;B(R)) refers to the coe�cients in an anisotropic basis for `2(R), and ~E(�) =Pdim(�)i=2 e(�i) refers to a relative entropy, which ignores the �rst coordinate. We let P�(R)

denote the corresponding optimal complete recursive dyadic partition of R.Solutions to BOB(R) have a key inheritance property. Let R be a dyadic rectangle

and suppose it has both vert-children and horiz-children. Then the optimal basis of Ris generated by a complete recursive dyadic partition P�(R) formed in one of two ways.This partition is either: (i) the union of optimal partitions of the horiz-children P�(R1;0)[P�(R1;1), or (iii) the union of optimal partitions of the vert-children P�(R2;0) [ P�(R2;1).Which of these two cases holds can be determined by holding a \tournament", selectingthe winner as the smallest of the numbers

BOB(R1;0) + BOB(R1;1) + e1; BOB(R2;0) + BOB(R2;1) + e2;

8

where ei = e(�iR).The exception to this rule is of course at the �nest scale: a 2-by-1 or 1 by 2 rectangle

has only one complete recursive partition, and no tournament is necessary to select a \best"one.

By starting from the next-to-�nest scale and applying the inheritance property, we canget the optimal partitions of all 4-by-1, of all 2-by-2 and of all 1 by 4 rectangles (omittingagain the tournament for 4 by 1 and 1 by 4 rectangles). And so on. Continuing in a�ne-to-coarse or `bottom-up' fashion, we eventually get to the coarsest level and obtain anoptimal partition for [0; n)2.

Once again there are � 4n2 = 4N dyadic rectangles. Each dyadic rectangle is visitedonce in the main part of the algorithm and there are at most a certain constant number Cof additions and multiplications per visit. Total work: � C � 4 �N ops and � 4N storagelocations. See the appendix for a formal description of the algorithm.

4 Best Basis De-Noising

CART has to do with removing noise from the data y to produce a reconstruction f ap-proximating the noiseless data f . The philosophy of BOB is much less speci�c: it may beused for many purposes, for example in data compression and for fast numerical analysis[5]. The application determines the choice of entropy, and the use of the expansion in thebest basis.

To use best-basis ideas for noise removal, one could apply the proposals of Donoho andJohnstone [11]. De�ne

E�(�) =NX1

min(�2i ; �2�2) (4.1)

and obtain an optimal basisB = min

B2LE�(�(y;B)): (4.2)

Then apply hard thresholding in the selected basis, at threshold level ��:

�i = �(y;B)i � 1fj�(y;B)ij>��g; 1 � i � N: (4.3)

Reconstruct the object f having coe�cients � in basis B. This is the de-noised object.[11] developed results, to be discussed in section 7, showing that with an appropriate

choice of �, the empirical basis chosen by this scheme was near-ideal.In the current setting, where L is the library of anisotropic Haar bases, (4.2) is amenable

to treatment by the fast best-basis algorithm of the last section. So it may be computed inorder N time. This de-noising estimate, while possessing certain nice characteristics, lacksone of the attractive features of CART: an interpretation as a spatially-adaptive averagingmethod. Such a spatially-adaptive method would have the form

f(i) =XR2P

hy; �Ri�R(i)

giving a piecewise-constant reconstruction based on rectangular averages of the noisy datay over rectangles R. Here the partition P = P(y) would be chosen data-adaptively, andonce the partition were chosen, the reconstruction would take a simple form of averaging.While we will mention this procedure further below, and use its properties, we mention itnow only to show that threshold de-noising in a Best-Ortho-Basis is not identical to CART.

9

5 Tree Constraints in the 1-d Haar System

In the context of the ordinary 1-d Haar transform, Joachim Engel (Engel, 1994) has shownthat a special type of reconstruction in the Haar system can be related to recursive parti-tioning. Let, temporarily, y = (yi)

n�1i=0 and suppose yi = g(i) + vi, with vi noise. Consider

reconstructions g of the form

g(i) = �y +XI

wIhy; hIi hI(i) (5.1)

where the sum is over dyadic subintervals of [0; n) and the wI are scalar \weights". Nowimpose on the weights (wI) two constraints

[Tree-i]. Keep-or-kill. Each weight is 1 or 0.[Tree-ii]. Heredity. wI can be 1 only if also wI 0 = 1 whenever I � I 0. If wI = 0 then

wI = 0 for every I 0 � I .Each set of weights satisfying these constraints selects the nodes of a dyadic tree T .

Engel has called such constraints tree-constraints, and shown that reconstructions obeyingthese constraints may be put in the form of spatial averages.

Theorem 5.1 (Engel, 1994) Suppose that g de�ned by (5.1) obeys the tree-constraints

(Tree-i) and (Tree-ii). Say that I is terminal if wI = 1 but every interval I 0 � I has

wI 0 = 0. The collection of terminal intervals forms a partition P, and

g(i) =XI2P

hy; �Ii�I (i): (5.2)

6 Hereditary Constraints and CART

Tree-constraints make sense also in the setting of 2-d anisotropic Haar bases. We considerreconstructions

f(i) = �y +X

R2NT (P�)wRhy; �s(R)R i�s(R)R (i) (6.1)

where P� is a complete recursive dyadic partition, f�[0;n)2 ; (�s(R)R )g the associated orthog-onal basis, and the weights (wR) obey two hereditary constraints

[Hered-i]. Keep-or-kill. Each weight wR is 0 or 1.[Hered-ii]. Heredity. wR = 1 implies wR0 = 1 for all ancestors R0 of R in P�; wR = 0

implies wR0 = 0 for all descendants of R in P�. We state without proof the analog of Engel'stheorem.

Theorem 6.1 The reconstruction f obeying (6.1), [Hered-i], and [Hered-ii] has precisely

the form

f(i) =XR2P

hy; �Ri�R(i)

for some possibly incomplete recursive dyadic partition P.

Three questions arise naturally about reconstructions obeying hereditary constraints.

Q1. How to �nd the best hereditary reconstruction in a given basis?

10

Q2. How to �nd the basis in which hereditary reconstruction works best?

Q3. How to e�ciently calculate the hereditary best-basis?

All three questions have attractive answers.

6.1 Best Hereditary Reconstruction in Given Basis.

Let T � denote the complete binary tree of depth log2(N)� 1. Identifying subtrees T � T �

with the weights (wR) obeying [Hered-i]-[Hered-ii], we write fB;T for the reconstruction(6.1) in basis B having weights (wI) associated with the tree T .

We de�ne the \best" hereditary reconstruction in terms of the hereditary CPRSS

CPRSS(T ;�;B) = ky � fB;Tk2`2 + �#(T ): (6.2)

and the optimization problem is the one achieving the minimal CPRSS among all suchreconstructions:

minT�T �

CPRSS(T ;�;B): (6.3)

By orthogonality of the basis B, we can reformulate this in terms of coordinates. Let� = �(y;B) denote the vector of coordinates and (wR�R(y;B)) denote the same vector afterapplying weights wR associated with the subtree T . The we have the following equivalentform of (6.2):

CPRSS(T ) =XR

((wR � 1)2�2R + �wR)

This quantity has an inheritance property, which we express as follows. Let T �(R) denotethe complete tree of depth log2(#R)� 1 and de�ne the optimization problem

(Hered(R)) minT�T �(R)

XR

((wR � 1)2�2R + �wR)

The optimization problem implicitly de�nes an optimal subtree T (R). The inheritanceproperty: the optimal subtree T (R) is a function of the optimal subtrees of the childrenproblems T (Rs(R);b), b = 0; 1. The tree T (R) is either: the empty subtree, or else it hasT (Rs(R);b) as subtrees joined at root(T (R)).

It follows by this inheritance property that the optimal subtree may be computed by abottom-up pruning exactly as in the optimal-pruning algorithm of CART, Algorithm 10.1,page 294 of the CART book. Hence, a minimizing subtree may be found in order N time.A formal statement of the algorithm is given in the appendix.

6.2 Best Basis for Hereditary Reconstruction

We can de�ne the quality of a basis for hereditary reconstruction by considering the op-timum value of the CPRSS functional over all hereditary reconstructions in that basis.Hence, de�ne the Hereditary entropy

H�(B) = minT�T �

CPRSS(T ;�;B): (6.4)

A Best Basis for hereditary reconstruction is then the solution of

minB�L

H�(B); (6.5)

11

where L is a library of orthogonal bases. This may be motivated in two ways. First, thegoal is intrinsically reasonable, as it seeks a best tradeo�, over all bases and all subtrees, ofcomplexity #(T ) against �delity to the data ky� fB;Tk22. Second, we will prove below thatthe reconstruction obtained in the optimum basis has a near-ideal mean-squared error.

6.3 Fast Algorithm via CART

The \entropy" H�(B) is not an additive functionalPN

i=1 e(�i(y;B)) of the coordinates ofy in basis B. Therefore the best-basis algorithm of Section 3, strictly speaking, does notapply. Luckily, we can use the fast CART algorithm. By now this is obvious, we summarizethis fact formally, though without writing out the proof.

Theorem 6.2 When � is the same in both, CART and BOB with hereditary constraints

have the same answers. More precisely,

minB2L

H�(B) = minP

CPRSS(P; �) (6.6)

The solution of the best-basis problem (6.5) gives, explicitly, an anisotropic basis B and,

implicitly by (6.3), an optimal subtree T ; the solution of the CART problem (2.4) gives an

optimizing partition P, and we have

f ^B;T (�) = f(�jP):

Remark 1. Although H�(B) is not additive, a fast algorithm for computing it is available| the dyadic CART algorithm of Section 1! This shows that fast Best Basis algorithmsmay exist for certain non-additive entropies.

Remark 2. Although CART and Best-Ortho-Basis are not the same in general, in this

case, with a speci�c set of de�nitions of Best Ortho Basis and a speci�c set of restrictionson the splits employed by CART, the two methods are the same.

7 Oracle Inequalities

CART and BOB de�ne objects which are the solutions of certain optimization problems andhence are in some sense \optimal". However, we should stress that they are optimal onlyin the very arti�cial sense that they solve certain optimization problems we have de�ned.

We now turn to the question of performance according to externally de�ned standards,which will lead ultimately to a proof of our main result. This will entail a certain kind of\near-optimality" with a more signi�cant and useful meaning.

In accordance with the philosophy laid out in [9], we approach this from two points ofview. First, there is a statistical decision theory component of the problem which we dealwith in this section; Second, there is a harmonic analysis component of the problem, whichwe deal with in the following section.

7.1 Oracle Inequalities

Once more we are in the model (2.1) and we wish to recover f with small mean-squarederror. We evaluate an estimator f = f(y) by its risk

R(f ; f) = Ekf(y)� fk22:

12

Suppose we have a collection of estimators � = ff(�)g; we wish to use the one bestadapted to the problem at hand. The best performance we can hope for is what Donohoand Johnstone (1994) call the ideal risk

R�(�; f) = inffR(f ; f) : f 2 �g:We call this ideal because it can be attained only with an oracle, who in full knowledge ofthe underlying f (but not revealing this to us) selects the best estimator for this f fromthe collection �.

We optimistically propose R�(�; f) as a target, and seek true estimators which canapproach this target. It turns out that in several examples, one can �nd estimators whichachieve this to within logarithmic terms. The inequalities which establish this are of theform

R(f�; f) � Const � log(N) � (�2 +R�(�; f)) 8fwhich Donoho and Johnstone [10, 11, 12] call oracle inequalities, because they compare therisk of valid procedures with the risk achieveable by idealized procedures which depend onoracles.

7.2 Example: Keep-or-Kill De-Noising

Suppose we are operating in a �xed orthogonal basis B and consider the family � of esti-mators de�ned by keeping or killing empirical coe�cients in the basis B. Such estimatorsf(y;w) are given in the basis B by

�i(f ;B) = wi � �i(y;B); i = 1; : : : ; N

where each weight wi is either 0 or 1. Such estimators have long been considered in thecontext of Fourier series estimation, where the basis is the fourier basis, the coe�cientsare Fourier coe�cients, and the wi are 1 only for 1 � i � k for some frequency cuto� k.Estimators of this form have also been considered by Donoho and Johnstone in the contextwhere B is a wavelet basis [10]; in that setting the unit weights are ideally chosen at sitesof important spatial variability.

Formally then � = ff(�;w) : w 2 f0; 1gNg is the collection of all keep-or-kill estimatorsin the �xed basis B. In [10], Donoho and Johnstone studied the nonlinear estimator f�,de�ned in the basis B by hard thresholding

�i(f�(y);B) = �p�N (�i(y;B)); i = 1; : : : ; N

where �t(y) = 1fjyj>tg(y)�sgn(y) is the hard thresholding nonlinearity and �N = �22 log(N).

They showed that f� obeys the oracle inequality

R(f�; f) � 2 � log(N) � (�2 +R�(�; f)) 8fas soon as N � 4. In short, simple thresholding comes within log-terms of ideal keep-or-killbehavior.

The reader will �nd it instructive to note that the estimator f� can also be de�ned asthe solution of the optimization problem

minw2f0;1gN

ky � f(y;w)k2+ �N#fi : wi 6= 0g

This is, of course, a complexity-penalized RSS, with penalty term �N . Thus the near-idealestimator is the solution of a minimum CPRSS principle.

13

7.3 Example: Best-Basis De-Noising

Suppose now we are operating in a library L of orthogonal bases B and consider the family� of estimators de�ned by keeping or killing empirical coe�cients in some basis B 2 L.Such estimators f(y;w;B) are of the form

�i(f ;B) = wi � �i(y;B); i = 1; : : : ; N

where each weight wi is either 0 or 1.Formally � = ff(�;w;B) : w 2 f0; 1gN ; B 2 Lg. For obvious reasons, we call R�(�; f)

also R�(Ideal Basis; f) In [11], Donoho and Johnstone developed a nonlinear estimatorf�, with near-ideal properties; it is precisely the best-basis-de-noising estimator de�ned insection 4; see (4.1)-(4.3). In detail they supposed that among all bases in the library thereare at mostM distinct elements. They suppose that we pick � > 8, and set t =

p2 loge(M),

then with � = (� � (1+ t))2, they prove a result almost as strong as the following, which weprove in the appendix below.

Theorem 7.1 For an appropriate constant A(�), the BOB estimator obeys the oracle in-

equality

R(f�; f) � A(�) � � � (�2 +R�(�; f)) 8f (7.1)

In short, empirical best basis (with an appropriate entropy) comes within log-terms of idealkeep-or-kill behavior in an ideal basis. In the speci�c case of the library of anisotropic Haarbases, M = 4N , and so for a �xed choice of �, (7.1) becomes

R(f�; f) � Const � log(N) � (�2 +R�(Ideal Basis; f)) 8f:

7.4 Example: CART

Oracle inequalities for CART are now easy to state. Suppose now we are operating inthe library L of anisotropic Haar bases and consider the family �Tree of hereditary linear

estimators, i.e. estimators de�ned by keeping or killing the empirical coe�cients in somebasis B 2 L, where the coe�cients that are kept fall in a tree pattern T . Such estimatorsf(y;T;B) are of the form

�i(f ;B) = wi � �i(y;B); i = 1; : : : ; N

where each weight wi is either 0 or 1, and the nonzero w form a tree.Formally �Tree = ff(�;T;B) : T � T � B 2 Lg be the collection of all hereditary lin-

ear estimators in any anisotropic Haar basis. The ideal risk R�(�Tree; f) is just the risk ofCART applied in an ideal partition selected by an oracle P. So call thisR�(Ideal CART; f)

Consider now the dyadic CART estimator f� de�ned with � exactly as speci�ed inthe best-basis de-noising setting of the last subsection. So for � > 8, set � = (� � (1 +p2 loge(4N)))2. We prove the following in the appendix.

Theorem 7.2 For all N � 1, the dyadic CART estimator obeys the oracle inequality

R(f�;�; f) � Const � log(N) � (�2 +R�(Ideal CART; f)) 8f

In short, empirical dyadic CART (with an appropriate entropy) comes within log-terms ofideal dyadic CART.

14

8 Mixed Smoothness Spaces

We now change gears slightly, and consider harmonic analysis questions. Speci�cally weare going to show that anisotropic Haar bases are particularly well-adapted to dealing withclasses of functions having mixed smoothness.

We denote now by f a function f(x; y) de�ned on [0; 1]2, rather than an array ofpixel values. We consider objects of di�erent smoothnesses in di�erent directions; thespeci�c notion of smoothness we use is based on what Temlyakov calls Nikol'skii classes [17].De�ne the �nite di�erence operators (D1

hf)(x; y) = f(x+ h; y)� f(x; y) and (D2hf)(x; y) =

f(x; y + h)� f(x; y). For �1; �2 satisfying 0 � �i � 1 de�ne the mixed smoothness class

F �1;�2p (C) = ff : kfkp � C; kD1

hfkLp(Q1h) � Ch�1 h 2 (0; 1);

kD2hfkLp(Q2

h) � Ch�2 h 2 (0; 1)g

where Q1h = [0; 1�h)� [0; 1] and Q2

h = [0; 1]� [0; 1�h). This contains objects of genuinelymixed smoothness whenever �1 6= �2. The usual smoothness spaces (H�older, Sobolev,Triebel, � � �) involve equal degrees of smoothness in di�errent directions, are are sometimescalled \isotropic," so that classes like F �1;�2

p (C) would be called \anisotropic."

8.1 Spatially Uniform Anisotropic Bases

De�nition 8.1 A sequential partitioning of j into two parts is a pair of series of integers

j1(j); j2(j); j = 0; 1; 2; � � � obeying

� Initialization. j1(0) = j2(0) = 0.

� Partition. j1(j) + j2(j) = j.

� Sequential Allocation.

j1(j) = j1(j � 1) + b1(j); b1 2 f0; 1g;j2(j) = j2(j � 1) + b2(j); b2 = 1� b1:

We can think of two boxes and a sequential scheme where at each stage we put a ball inone of the two boxes. ji(j) represents the number of balls in box i at stage j, and b1 = 1�b2represents the constraint that only one ball is put into the boxes at each stage.

De�nition 8.2 Consider a sequential partition of j into two parts. The spatially uniformalternating partition subordinate to this partition { SUAP (j1; j2) { is a complete dyadic

recursive partiton formed in a homogeneous fashion: at Stage 1, the square [0; 1]2 is split

horizontally if b1(1) = 1, and vertically if b2(1) = 1; at Stage 2, each of the two resulting

rectangles is split in two, horizontally if b1(2) = 1, vertically if b2(2) = 1; and at Stage j,

each of the 2j�1 rectangles of volume 2�j+1 formed at the previous stage is split vertically

if b1(j) = 1, horizontally if b2(j) = 1.

The recursive partition SUAP(j1; j2) de�nes a series of collections of rectangles as fol-lows. R(0) consist of the root rectangle, R(1) consists of the two children of the root, R(2)of the four children of the rectangles in R(1), etc. In general R(j) consists of 2j rectanglesof area 2�j each.

15

This sequence of rectangles de�nes an orthogonal basis of L2([0; 1]2) in a fashion similarto the discrete case, with fairly obvious changes due to the change in setting. Let now I

denote a dyadic subinterval of [0; 1] and ~�I(x) be the \same function" as �I(i), under thecorrespondence xi $ i=n and under the di�erent choice of normalizing measure: ~�I (x) =1I(x) � `(I)�1=2 where `(I) denotes the length of I . Similarly, let ~hI(x) = (1I1(x)� 1I0(x)) �`(I)�1=2. Then set '1

R = ~hI1(x)~�I2(y), '2R = ~�I1(x)

~hI2(y). Then set

�0 = �[0;1]2 ; �0;0 = 'b2(1)[0;1]2;

�1;R = 'b2(2)R for R 2 R(1);

�2;R = 'b2(3)R for R 2 R(2);

and in general

�j;R = 'b2(j)R for R 2 R(j); (8.1)

call this the spatially homogeneous anisotropic basis SHAB (j1; j2).The coe�cients of f in this basis are

�f = Ave[0;1]2; �R = h�j;R; fi; R 2 R(j): (8.2)

8.2 Best Basis for a functional class

Donoho (1993) described a notion of \best-orthogonal-basis" for a functional class F whichdescribes the kinds of bases in which certain kinds of de-noising and data compression cantake place. For this notion, the best basis for a functional class F is that basis in whichthe coe�cients of members of F decay fastest.

For a vector � in sequence space let j�j(i) denote the rearranged magnitudes of thecoe�cients, sorted in decreasing order j�j(1) � j�j(2) � : : :. The weak `� norm measures thedecay of these by

k�kwl� = supk�1

k1=� j�j(k):

This measures decay of the coe�cients since k�kwl� � C implies j�j(k) � Ck�1=� , k =1; 2; : : :.

Now, with � = (�i(f;B)) the coe�cients of f in an orthogonal basis B, a given functionalclass F maps to a coe�cient body �(F ;B) = f(�i(f;B))i : f 2 Fg. For such a set �, wesay � � wl� if

supfk�kwl� : � 2 �g <1:

De�nition 8.3 We call the critical exponent of � the number ��(�) obtained as the in�-

mum of all � for which � � wl� .

Assuming that F is a subset of L2, 0 � ��(�) � 2.From this point of view a \best basis" for F is any basis B� which minimizes the critical

exponent:��(�(F;B�)) = min

B��(�(F;B)): (8.3)

In such a basis the rearranged coe�cients will be the most rapidly decaying, among allortho bases.

16

8.3 Best Anisotropic Bases

With this background, it is interesting to ask about the decay properties of coe�cients indi�erent spatially homogeneous anisotropic bases. We might hope to identify a basis Bwithin the class of anisotropic Haar bases satisfying (8.3) among all bases. In fact we can.The key fact is this upper bound.

Lemma 8.4 Let f 2 F �1;�2p (C). If b1(j) = 1,

� XR2R(j)

j�Rjp�1=p

� C � 2�j1�1 � (2�j)1=2�1=p; (8.4)

while if b2(j) = 1 � XR2R(j)

j�Rjp�1=p

� C � 2�j2�2 � (2�j)1=2�1=p: (8.5)

Now a choice of spatially uniform anisotropic partition which would make optimal useof these expressions as a function of j would arrange things so that the decrease of thelargest of the two expressions went fastest in j. Thus optimal use of Lemma 8.4 leads tothe problem of constructing a sequential partition of j into parts that optimizes the rate ofdecay of

max(2�j1(j)�1; 2�j2(j)�2) (8.6)

as a function of j.There is an obvious limit on how well this can be done. Consider optimizing (8.6) subject

to only the constraints j1(j) + j2(j) = j and ji � 0, i.e. without imposing the requirementthat the ji be integers, or be sequentially chosen. The solution is j1(j) = �2=(�1 + �2) � jand j2(j) = �1=(�1 + �2) � j, achieving an optimally small value of

2�j�; � =�1�2

�1 + �2; (8.7)

in (8.6). We cannot hope to do better than this, once we re-impose the constraints associatedwith a sequential partition. But we can come close.

De�nition 8.5 For a given pair of \exponents" �1; �2 obeying 0 < �i � 1, we call an

optimal sequential partition of j a sequential partitioning (j�1 ; j�2) obtained as follows:

1. Start from j�1(0) = j�1(0) = 0.

2. At stage j + 1, allocate b1(j) and b2(j) as follows.

a. If j�1(j � 1)�1 = j�2(j � 1)�2 \allocate the ball to whichever box has the smaller

exponent": b1(j) = 1 if �1 < �2.

b. If j�1(j � 1)�1 6= j�2(j � 1)�2 \allocate the ball to whichever box has the smaller

product": j�i (j � 1)�i.

This so-called optimal sequential partitioning of j is a greedy stepwise minimization ofobjective (8.6) It turns out that it is near-optimal, even among nonsequential partitions.

17

Lemma 8.6 For 0 < �1; �2 � 1, and � = �1�2�1+�2

,

max(2�j�

1 (j)�1; 2�j�

2 (j)�2) � 2 � 2�j�: (8.8)

Due to (8.7), this is essentially optimal within the class of sequential partitions of j.Inspired by this, we propose the following

De�nition 8.7 We call Best Anisotropic Basis BAB(�1; �2) the anisotropic basis of L2[0; 1]2

de�ned using SHAB(j�1 ; j�2).

Combining this Lemma with Lemma 8.4 above

Corollary 8.8 If we use the BAB(�1; �2) then for � = �1�2�1+�2

,

� XR2R(j)

j�Rjp�1=p

� 2 � C � 2�j(�+1=2�1=p): (8.9)

8.4 Optimality of BAB

Armed with Corollary 8.8, it is possible to justify De�nition 8.7 and prove that BAB(�1; �2)is an optimal basis in the sense of section 8.2.

Theorem 8.9 Let L denote the collection of all orthogonal bases for L2[0; 1]2.

��(�(F �1;�2p (C); BAB(�1; �2))) = min

B2L��(�(F �1;�2

p (C);B)): (8.10)

The proof is a consequence of three lemmas, all of which are proved in the appendix.The �rst gives an evaluation of the critical exponent for BAB(�1 ; �2).

Lemma 8.10

��(�(F �1;�2p (C); BAB(�1; �2))) = 2=(2�+ 1): (8.11)

The optimality of this exponent follows from a lower bound technique developed atgreater length in [8]. First, a de�nition: an orthogonal hypercube H of dimension m andside � is a collection of all sums g0+

Pmi=1 �igi where the gi are orthonormal functions and

the j�ij � �.

Lemma 8.11 Suppose F contains a sequence Hj of orthogonal hypercubes of dimension

mj and side �j where �j ! 0, mj !1,

m1=�j �j � c0 > 0

Let L denote any collection of orthogonal bases.

infB2L

��(�(F ;B)) � �:

Lemma 8.12 Each class F �1;�2p (C) contains a sequence Hj of orthogonal hypercubes of

dimension mj = 2j and side �j where �j ! 0, mj !1,

m1=�j �j � K � C;

with K a �xed constant, and � = 2=(2�+ 1).

18

9 Near-Minimaxity of BOB

As a result of the harmonic analysis in section 8 and the ideas in [8], we know thatBAB(�1; �2) is the best basis in which to apply ideal keep-or-kill estimates. This is thekey stepping stone to our main result.

In this section we show that the risk for ideal keep-or-kill in BAB(�1; �2) is withinconstants of the minimax risk over each F �1;�2

p (C). From the oracle inequality of section7.2, we know that empirical basis selection, as in (4.1)-(4.3), which empirically selects abasis and applies thresholding within it, will always be nearly as good as ideal keep-or-kill inBAB(�1; �2) { even though it makes no assumptions on �1 or �2. This means that empiricalbest-basis de-noising obeys a near-minimaxity result like Theorem 1.1.

Theorem 9.1 Best-Basis De-Noising, de�ned in section 4 above, with � de�ned as in

section 7.2 above, comes within logarithmic factors of minimax over each functional class

F �1;�2p (C), 0 < �1; �2 � 1; C > 0; p 2 (0;1). If f�;� denotes the best-basis-denoising

estimator

supF

Ekf�;� � �fk2`2 � Const(�1; �2; p) � log(n) �M�(�; n;F) as n!1: (9.1)

for each F 2 MS.

The key arguments to prove Theorem 9.1 are given in sections 9.1 and 9.4 below. Ourmain result { Theorem 1.1 { will be proved in section 10 by using some of those results asecond time.

9.1 Lower Bound on the Minimax Risk

We �rst study the minimax risk and show that it obeys the lower bound

M�(�; n;F �1;�2p (C)) � K(�1; �2) � (C2)1�r(�2)r; as � = �=

pN ! 0: (9.2)

where r = 2�2�+1 .

We use the method of cubical subproblems. Modi�ed de�nition: in this section, byorthogonal hypercube H of dimension m and side � we mean a collection of all sums g0 +Pm

k=1 �kgk where the gk = gk(i1; i2) are n by n arrays, orthonormal with respect to thespecially normalized `2N norm

1pN

Xi1;i2

gk(i1; i2)gk0(i1; i2) = 1fk=k0g;

and all the j�ij � �. The following Lemma may be proved as in [12].

Lemma 9.2 Let � = �=pN. Suppose a class F contains an orthogonal hypercube of side-

length � � � < 1110� and dimension m(�). Then, for an absolute constant A > 1=10,

M�(�; n;F) � A �m(�) � �2: (9.3)

To make e�ective use of this, we seek cubes of su�ciently high dimension and prescribedsidelength. The following lemma is proved in the Appendix.

19

Lemma 9.3 Let � = �=pN. Each class F �1;�2

p (C) contains orthogonal hypercubes (or-

thogonal with respect to `2N norm) of sidelength � = � � (1 + o(1)) and dimension m(�; C)where

m(�; C) � K(�1; �2) � (C=�)2

2�+1 ; 0 < � < �0; (9.4)

and

� =�1 � �2�1 + �2

: (9.5)

Combining these two lemmas gives the lower bound (9.2).

9.2 Equivalent Estimation Problems

Sections 2-7 of this paper work in a setting of n by n arrays. Section 8 works in a settingof functions on the continuum unit square. Theorem 9.1 is based on a combination of bothpoints of view.

From the viewpoint of Sections 2-7 one would naturally consider applying CART andBOB estimators to data yi, i 2 [0; n)2. Suppose instead that we de�ne the rescaled data

~yi = N�1=2yi; i 2 [0; n)2;

and also de�ne � = �=n = �=pN . The results we get in applying (appropriately calibrated)

CART or BOB to such data are (obviously) proportional to the results we get in applyingthe same techniques to the unscaled data.

There is a connection between these rescaled data and data about the function f onthe continuum square. Let R denote both a dyadic rectangle of [0; n)2 and the \same"rectangle on the continuum square [0; 1]2. Recall that '1

R(x; y) denotes a function on thecontinuum square [0; 1]2 normalized to L2[0; 1]2-norm 1, and �1R(i1; i2) = hI1(i1)�I2(i2) isthe \same" function, only on the grid [0; n)2 and normalized to `2(N) norm 1. Then

h~yi; �1Ri`2(N) = hf; '1RiL2[0;1]2 + � � z1R;

where the z1R are N(0; 1), and independent in rectangles which are disjoint. Similar rela-tionships hold between �2R and '2

R.Hence the discrete-basis analysis of rescaled data ~yi has the interpretation of giving noisy

measurements about the continuum coe�cients of f { and vice versa. Moreover, supposethat P� is a complete dyadic recursive partition of the discrete grid [0; n)2 and we consideronly the coe�cients attached to rectangles in the nonterminal nodes of this partition. Thepartial reconstruction of f from just those coe�cients is simply the collection of f 's pixellevel averages; formally, if we put

f(i1; i2) =X

R2NT (P�)�R�

s(R)R

and�f(x; y) =

XR2NT (P�)

�R's(R)R (x; y)

then �f(x; y) takes the value f(i1; i2) throughout the rectangle [i1=n; (i1+1)=n)� [i2=n; (i2+1)=n).

20

Consider the problem of estimating (�sR(R))R2NT (P�) from noisy data hf; 's(R)R iL2[0;1]2+� � zs(R)R . By Parseval, the squared `2 risk

�2 +EX

R2NT (P�)

(�s(R)R � �

s(R)R )2 = N�1 �Ekf � �fk2`2(N) (9.6)

and so the mean-squared error in the coe�cient domain gives us the mean-squared errorfor recovery of pixel-level averages in the other domain.

9.3 Discrete and Continuous Partitionings

Consider now BAB(�1; �2) for a given �1; �2 pair. This corresponds to an in�nite sequence offamilies R(j), each family partitioning the continuum square [0; 1]2 by congruent rectanglesof area 2�j .

Such a sequence of partitions usually does not have, for its initial segment log2(N)members, a sequence of partitions which can be interpreted as a sequence of partitions forthe discrete square 0 � i1; i2 < n. A sequence of partitions for the discrete square alsohas the constraint that out of the �rst log2(N) splits, exactly half will be vertical and halfhorizontal. Put another way, if we consider some BAB, those rectangles which are nottoo narrow in any direction { i.e. where each sidelength exceeds 1=n { also correspond torectangles in a complete dyadic recursive partition of the discrete square [0; n)2. But thereexist BAB (for example those with min(�1; �2) close to zero and max(�1; �2) close to one)which, at some level j between log2(n) and log2(N), have already split in a certain directionmore than log2(n) times. Consequently, the continuum BAB is not quite available in theanalysis of �nite datasets.

On the other hand, in the analysis of �nite datasets, there are available bases whichachieve the same estimates of coe�cient decay as in the continuum case.

De�nition 9.4 For a given pair of exponents (�1; �2), and whole number J, we call a

Balanced Finite Optimal Sequential Partition, an application of the optimal sequential par-

titioning rule of de�nition 8.5, with two extra rules:

3. The process stops at stage 2J. There are at most 2J \balls".

4. The process must preserve j�i (j) � J. Once a certain \box" has \J balls", all remain-

ing allocations of \balls" are to the \other box".

Lemma 9.5 0 < �1; �2 � 1, let j�i (j) denote the result of a Balanced Finite Optimal

Sequential Partitioning. Let R�(j) denote the associated collection of rectangles. With

� = �1�2�1+�2

, � XR2R�

(j)

j�Rjp�1=p

� 2 � C � 2�j(�+1=2�1=p): (9.7)

The proof is simply to inspect the proof of Corollary 8.7 and notice that the constraintpreventing allocation of \balls" to certain \boxes" means that in certain expressions onecan replace terms like

max(2�j�

1 (j)�1; 2�j�

2 (j)�2)

by the even smallermin(2�j

1 (j)�1 ; 2�j�

2 (j)�2):

21

9.4 Upper Bound on Ideal Risk

We now study the ideal risk and show that it obeys an upper bound similar in form tothe lower bound of section 9.1. Starting now, let BAB�(�1; �2) denote the modi�ed basisdescribed in the previous subsection.

Lemma 9.6 LetR(Keep-Kill; f ; �) be the ideal risk for keep-kill estimation in BAB�(�1; �2).Then with r = 2�

2�+1 ,

supf2F�1;�2

p (C)

R(Keep-Kill; f ; �) � B(�1; �2; p) � (C2)1�r(�2)r; 0 < � < �0: (9.8)

Proof. As in [13] consider the optimization problem

mj(�; ) = max k�k2`2 subject to k�k`1 � � k�k`p � ; � 2 R2j :

By Parseval (9.6) the best possible risk for a purely keep-kill estimate is �2+P

min(�2R; �2).

Also, by Lemma 9.5, there are constants j = j(C) so that for f 2 F �1;�2p (C),

(XR�

(j)

j�Rjp)1=p � j

The largest risk of ideal keep-kill is thus

maxf2F�1;�2

p (C)

Pj

XR2R�

(j)

min(�2R; �2)

� maxXj

XR�

(j)

min(�2R; �2) subj. to (

XR�

(j)

j�Rjp)1p � j

=Xj

mj(�; j)

Now [13] gives the explicit evaluation

mj(�; ) = min(2j�2; p�2�p; 2): (9.9)

and applying this, we haveXj

mj(�; j) � (C2)1�r(�2)r �K(�1; �2; p):2 (9.10)

This is the risk of an ideal de-noising by a keep-or-kill estimator not obeying hereditaryconstraints.

9.5 Near-Minimaxity of Best-Basis De-Noising

We have so far shown that the ideal risk is within constant factors of the minimax risk.Invoking now the oracle inequality of Theorem 7.1, the worst-case risk of the BOB estimatorf does not exceed the ideal risk { and hence the minimax risk { by more than a logarithmicfactor. This completes the proof of Theorem 9.1.

22

10 Near-Minimaxity of CART

We now are in a position to complete the proof of Theorem 1.1. We do this by showingthat ideal dyadic CART is essentially as good as ideal best-basis de-noising.

Lemma 10.1 LetR(Keep-Kill; f ; �) be the ideal risk for keep-kill estimation in BAB�(�1; �2).Let R(Hered; f ; �) be the ideal risk for hereditary estimation in BAB�(�1; �2).

supf2F�1;�2

p (C)

R(Hered; f ; �) � B(�1; �2; p) � supf2F�1;�2

p (C)

R(Keep-Kill; f ; �): (10.1)

Once this lemma is established, it follows from sections 9.1 and 9.4 that the risk of idealdyadic CART is within constant factors of the minimax risk. Now the oracle inequality fordyadic CART, Theorem 7.2, shows that the performance of empirical dyadic CART comeswithin logarithmic factors of the ideal risk for dyadic CART. Theorem 1.1 therefore followsas soon as Lemma 10.1 is established.

To prove the Lemma, note that the ideal keep-or-kill estimator for a function f hasnonzero coe�cients at sites

S(f) = f(j; R) : j�j;R(f)j � �g:

This can be modi�ed to a hereditary linear estimator by replacing S by S�, the hereditarycover of S.

The ideal risk of the hereditary linear estimator �[S�] obeys

Ek�[S�]� �k22 =X

(j;R)62S��2j;R + �2(#(S�) + 1)

�X

(j;R)62S�2j;R + �2(#(S�) + 1) (as S � S�):

Suppose we could bound #S� � A � (#S) for some constant A > 0. Then we would have

Ek�[S�]� �k22 �X

(j;R)62S�2j;R + �2(A#(S) + 1) (as #S� � A#S)

� (A+ 1) �

0B@ X(j;R)62S

�2j;R + �2

1CA

= (A+ 1) �Ek�[S]� �k22:

It would then follow that risk bounds derived for keep-or-kill estimators would give rise toproportional risk bounds for hereditary estimators.

While the relation #S� � A � (#S) does not hold for every f , a weaker inequality of thesame form holds, where one compares the largest possible size of #S(f) for an f 2 F �1;�2

p (C)with the largest possible size of #S�. The lemma just below establishes this inequality;retracing the logic of the last few displays shows that it immediately implies Lemma 10.1,with B = A+ 1.

Lemma 10.2 De�ne

N(�1; �2; p; C) = sup f#S(f) : f 2 F �1;�2p (C)g (10.2)

23

the largest number of coe�cients used by an ideal keep-kill estimator in treating functions

from F �1;�2p (C)(�1; �2). Similarly, let

N�(�1; �2; p; C) = sup f#S�(f) : f 2 F �1;�2p (C)g (10.3)

be the size of the largest corresponding hereditary cover. Then for a �nite positive constant

A = A(�1; �2; p),N� � A(�1; �2; p) �N:

Proof. If � = (�i)di=1 is a vector of dimension d satisfying k�k`p � , then #fi : j�ij �

�g�p � p so#fi : j�ij � �g � ( =�)p: (10.4)

and of course#fi : j�ij � �g � d: (10.5)

Consider now the application of this to the vector �i = (�j;Ri)i which has d = 2j, with

= j(F �1;�2p (C)). Then #fi : j�ij � �g � min(2j; ( j=�)p)). The �rst term 2j is sharp for

0 � j � j0, where j0 = j0(�; C; �1; �2) is the real root of 2j = (C2�j(�+1=2�1=p)=�)p. By a

calculation, j0 = log2(C=�)=(�+ 1=2).For notational convenience, stratify the set S as

S = S0 [ S1 [ � � �Sj [ � � � ;

whereSj = f(j0; R) 2 S; j0 = jg

and#Sj = 2j 0 � j � j0: (10.6)

Also we have#Sj � 2j0 � 2��(j�j0) � = �(�1; �2; p) j � j0 (10.7)

Now consider the cover S�� de�ned by:

S�� = f (j; R) 0 � j � j0

(j; R) j > j0 and (j; R) has a descendant in Sg:

By construction, S�� contains the hereditary cover (it contains terms at j < j0 which thehereditary cover might not), and so bounds on the size of S�� apply to S� also. Now

#S�� � 2j0+1 +Xj>j0

A(j; R) �#Sj (10.8)

where A(j; R) is the number of ancestors (j0; R0) of a term (j; R) at level j0 < j0 � j � J .As A(j; R) � (j � j0),

#S�� � 2j0+1 +Xj>j0

(j � j0)2j02��(j�j0)

= 2j0 �0@2 + X

j>j0

(j � j0)2��(j�j0)

1A :

24

We conclude thatN�(�1; �2; p) � 2j0 �B1;

for some constant B1(�1; �2; p). On the other hand, by constructing a hypercube at levelbj0c using the approach of Lemmas 8.12 and 9.3, we obtain, for a constant B2(�1; �2; p)

N(�1; �2; p) � B2 � 2j0 :

Hence we may take A = B1=B2. 2

11 Discussion

We collect here some �nal remarks.

11.1 Clari�cations

We would like to clearly point out that the way that the term \CART" is generally construed{ as \greedy growing" of an exhaustive partition followed by \optimal pruning" in theimplicit basis { is not what we have studied in this paper. Also, the data structure we haveassumed { regular equispaced data on a two-dimensional rectangular lattice { is unlike theirregularly scattered data often assumed in CART studies. It would be interesting to knowwhat properties can be established for the typical \greedy growing" non-dyadic CARTalgorithm, in the irregularly scattered data case.

To minimize misunderstanding, let us be clear about the intersection between CARTand BOB. CART is a general methodology used for classi�cation and discrimination or forregression. It can be used on regular or irregularly spaced data and it can construct optimalor greedy partitions within the general framework. Best-Ortho-Basis is a general method-ology for adaptation of orthogonal bases to speci�c problems in applied mathematics. Itcan be used in constructing adaptive time frequency bases, and also (we have seen in thispaper) in constructing adaptive bases for functions on cartesian product domains. We haveshown that the methods have something in common, but, strictly speaking, only intersectunder a very speci�c choice of problems area and entropy. Further discussion about patentlawsuits is unwarranted and pointless.

11.2 Extensions

Somewhat more general results are implicit in the results established here.First, one can consider classes F�1;�2

p1;p2(C1; C2) of functions obeying

kDihfkpi � Cih

�i ; i = 1; 2:

The classes we have considered here in this paper are the special cases C1 = C2 = C andp1 = p2 = p. Parallel results hold for these more general classes, and by essentially thesame arguments, with a bit more book-keeping. We avoided the study of these more generalclasses only to simplify exposition.

Second, the log-terms we have established in Theorems 1.1 and 9.1 can be replacedby somewhat smaller log-terms. More speci�cally, in cases where the minimax risk scaleslike N�r the method of proof given here actually shows that the worst-case risk of dyadicCART is within a factor O(log(n)r) of minimax. As 0 < r < 1, this is an improvement inthe size of the log term.

25

11.3 Important Related Work

We also mention some related work that may be of interest to the reader.Complexity Bounds and Oracle Inequalities. Of course there is a heavy reliance of this

paper on [11, 12]. But let us also clearly point out that the general idea of oracle inequalitiesis clearly present in Foster and George (1995), who used a slightly di�erent oracle less suitedfor our purposes here. Our underlying proof technique { the Complexity Bound underlyingthe proofs of Theorems 7.1 and 7.2 { is very closely related to the Minimum Complexityformalism of Barron and Cover (1991), and subsequent work by Birg�e and Massart (1994).

Density Estimation. This paper grew out of a discussion with Joachim Engel, whowondered how to generalize the results of [14] to higher dimensions. Engel (personal com-munication) has reported progress on obtaining results on the behavior of a procedure likeDyadic CART in the setting of density estimation.

Mixed Smoothness Spaces. Neumann and von Sachs [16] have also recently studiedanisotropic smoothness classes, and have shown that wavelet thresholding in a tensorwavelet basis is nearly-minimax for higher-order mixed smoothness classes. This showsthat non-adaptive basis methods could also be used for obtaining nearly minimax results;the full adaptivity of CART is not really necessary for minimaxity alone.

Time-Frequency Analysis. Important related ideas are contained in two recent manuscriptsassociated with Coifman's group at Yale. FIrst, the manuscript of Thiele and Villemoes,which independently uses fast dyadic recursive partitioning of the kind discussed here, onlyin a setting where the two dimensions are time and frequency. Second, the manuscript ofBennett, which independently uses fast dyadic recursive partitioning of the kind discussedhere, only in a setting where the basis functions are anisotropic Walsh functions rather thananisotropic Haar functions.

12 Appendix A: Fast Algorithms

12.1 Fast Algorithm for Dyadic CART

For a given rectangle R, we let A(R) denote AvefyjRg and V (R) denote VarfyjRg. Wenote that if R = R1[R2, then A(R) and V (R) are simple combinations of A(Ri) and V (Ri)for i = 1; 2. We call such combinations the updating formulas. The objective functionalcan be written

CPRSS(P; �) =XR�P

fV (R) + �jRjg: (12.1)

Congruency Classes. Dyadic rectangles come in side lengths 2j1� 2j2 for 0 � j1; j2 � j.We let R(j1; j2) denote the collection of all rectangles with side lengths 2j1 � 2j2 . Thereare n=2j1 � n=2j2 such rectangles.

ALGORITHM: DYADIC CARTDescription. Finds the recursive dyadic partition of the n� n grid which optimizes the

complexity-penalized residual sum of squares.This algorithm, when it terminates, has placed, in the arrays CART and Decor, su�-

cient information to reconstruct the best partition. Indeed, starting at R0 = [0; n)2 andnoting that Decor(R0) contains the indicator of the splitting direction s(R0) in Slot 2, wecan follow a path to (Rs(R0);0; Rs(R0);1), and from there to their children, etc. In doing sowe are traversing a decorated binary tree with decorations [\split"; 1] or [\split"; 2] at each

26

nonterminal node; at the terminal nodes are the decorations [\term00; Ave(R)]. This de-scribes a recursive dyadic partition P and a piecewise constant reconstruction. The optimalCART value is

CART (R0) (12.2)

Inputs.

yi1;i2 0 � ii; i2 < n. Data to be analyzed.

Results.

(CART (R))R2P Value of subproblem associated with R

(Decor(R))R2P Decoration associated with R

Internal Data Structures

(A(R))R2P Average of values in R

(V (R))R2P Sum of values in R

Initialization.For each 1-by-1 rectangle Ri

Set A(Ri) = yi and V (Ri) = 0

Main Loop.For h = 1; � � � ; 2 log2(n)

For each pair (j1; j2) satisfying j1 + j2 = h; 0 � j1; j2 � h.For all R in R(j1; j2)

Compute A(R) from A(R1;b), b = 0; 1Compute V (R) from A(R1;b); V (R1;b), b = 0; 1Compute

CART (R) = min(V (R);CART (R1;0) + CART (R1;1); CART (R2;0) + CART (R2;1))

if CART (R) � V (R)Decor(R) = [\term"; A(R)]

Elseif CART (R) � CART (R1;0) + CART (R1;1)Decor(R) = [\split"; 1]

Elseif CART (R) � CART (R2;0) + CART (R2;1)Decor(R) = [\split"; 2]

EndEnd

EndEnd

12.2 Fast Transform into Anisotropic Haar Basis

Description. Finds the orthogonal decomposition of a vector y into linear combinations ofbasis elements from speci�ed anisotropic Haar basis.

Inputs.

yi1;i2 0 � ii; i2 < n. Data to be analyzed.R(h) list of R 2 NT (P�) having area n=2h

27

s(R) splitting direction of R 2 NT (P�)

Results.

(�R)R2P Coe�cients in anisotropic Haar basis

Internal Data Structures.

(S(R))R2P Average of values in R (V (R))R2P Sum of values in R

Initialization.For each 1-by-1 rectangle Ri = f[i1; i1+ 1)� [i2; i2+ 1)g set S(Ri) = yi.

For h = 1; 2; � � � ; 2 log2(n)For each R 2 R(h)

S(R) = S(Rs(R);0) + S(Rs(R);1)�R = (S(Rs(R);1)� S(Rs(R);0)) � 2�h=2

EndEnd�[0;n)2 = S([0; n)2)=n:

The algorithm takes 3N additions and multiplications.

12.3 Fast Algorithm for Best Anisotropic Haar Basis

Description: Finds the best orthogonal basis for the vector space of n�n arrays among allbases which arise from complete recursive dyadic partitions.

This algorithm, when it terminates, has placed, in the arrays BOB and Decor, su�cientinformation to reconstruct the best basis. Indeed, starting at R0 = [0; n)2 and notingthat Decor(R0) contains s(R0) in Slot 2, we can follow a path to (Rs(R0);0; Rs(R0);1), andfrom there to their children, etc. In doing so we are building a decorated binary tree withdecorations [\split"; 1] or [\split"; 2] at each nonterminal node; this describes a completerecursive dyadic partition P� and hence a basis B. Optimality of this basis follows fromfamiliar arguments in dynamic programming. The optimal entropy is

E(B) = e(S(R0)jR0j�1=2) + BOB(R0) (12.3)

Results.

(BOB(R))R2P Entropy associated with R

(Decor(R))R2P Decoration associated with R

Internal Data Structures.

(S(R))R2P Sum associated with R

(D(R; 1))R2P Horiz Di�erence associated with R

(D(R; 2))R2P Vert Di�erence associated with R

Initialization.

For each i 2 [0; n)2, set S(Ri) = f(i), and BOB(Ri) = 0

Main Loop.

28

For h = 1; 2; � � � ; 2 log2(n)For each pair (j1; j2) with ji � 0 and j1 + j2 = h

For each R 2 R(j1; j2)

S(R) = S(R1;0) + S(R1;1)

D(R; 1) = (S(R1;1)� S(R1;0)) � 2�h2

D(R; 2) = (S(R2;1)� S(R2;0)) � 2�h2

E(R; 1) = BOB(R1;0) +BOB(R1;1) + e(D(R; 1))

E(R; 2) = BOB(R2;0) +BOB(R2;1) + e(D(R; 2))

If E(R; 1)< E(R; 2)Decor(R) = [\split"; 1]BOB(R) = E(R; 1)

ElseDecor(R) = [\split"; 2]BOB(R) = E(R; 2)

EndEnd

EndEnd

(Note: We have written the algorithm as if every R has both vertical and horizontalsplits. Actually only those R with both j1; j2 � 1 do so; an extra \if" branch needs to beinserted to cover the other cases).

12.4 Fast Algorithm for Best Hereditary Reconstruction

ALGORITHM: Best Heredity

Description. Finds the subtree of a given tree which optimizes the complexity-penalizedresidual sum of squares.

This algorithm, when it terminates, has placed, in the arrays Best and Decor, su�cientinformation to reconstruct the best subtree. Indeed, starting at R0 = [0; n)2 and notingthat Decor(R0) contains the indicator of the splitting direction s(R0) in Slot 2, we can fol-low a path to (Rs(R0);0; Rs(R0);1), and from there to their children, etc. In doing so we aretraversing a decorated binary tree with decorations [\split"; 1] or [\split"; 2] at each non-terminal node; at the terminal nodes are the decorations [\term"; Ave(R)]. This describesa recursive dyadic partition P and a piecewise constant reconstruction. The optimal valueof the CPRSS is

Best(R0): (12.4)

Inputs.

yi1;i2 0 � ii; i2 < n. Data to be analyzed.P� Complete Recursive Dyadic Partition.

Results.

(Best(R))R2P� Value of subproblem associated with R

(Decor(R))R2P� Decoration associated with optimal heredity

29

Internal Data Structures

(A(R))R2P� Average of values in R

(V (R))R2P� Variance of values in R

Initialization.For each terminal rectangle RSet A(R) = yi and V (Ri) = 0

Main Loop.For j = 2 log2(n)� 1; � � � ; 1

For each rectangle R 2 R(j)Compute A(R) from A(Rs(R);b).Compute V (R) from A(Rs(R);b); V (Rs(R);b).Compute

Best(R) = min�V (R); Best(Rs(R);0) + Best(Rs(R);1) + �

�if Best(R) � V (R)

Decor(R) = [\term"; A(R)]Elseif Best(R) � Best(Rs(R);0) +Best(Rs(R);1) + �

Decor(R) = [\split"; s(R)]End

EndEnd

13 Appendix B: Proofs

13.1 Proof of Theorems 7.1 and 7.2

We prove a more general fact, concerning estimation in overcomplete dictionaries. Theproof we give is a light modi�cation of a proof in [12].

13.1.1 Constrained Minimum Complexity Estimates

Suppose we have an N by 1 vector y and a dictionary of N -by-1 vectors ' . We wish toapproximate y as a superposition of dictionary elements y �Pm

i=1 � ' .We construct a matrix � which is N by p, where p is the total number of dictionary

elements. Let each column of the � matrix represent one dictionary element. Note that inthe case of most interest to us, p� N , as � \contains more than just a single basis". Forexample, in the setting of this paper, D is the dictionary of all anisotropic Haar functions,which has approximately p = 4N elements.

For approximating the vector y, we consider the vector ~� 2 Rp, the vector ~f = �~�denotes a corresponding linear combination of dictionary elements. This places the ap-proximation ~f in correspondence with the coe�cient vector ~�. Owing to the possibleovercompleteness of �, this correspondence is in general one-to-many.

De�ne now the empirical complexity functional

K( ~f; y) = k ~f � yk22 + �2 � �2 �N( ~f);

30

whereN( ~f) = min

~f=�~�#fj : ~�j 6= 0g

is the complexity of constructing ~f from the dictionary �. Also, de�ne the theoretical

complexity functionalK( ~f; f) = k ~f � fk22 + �2 � �2 �N( ~f):

Let C be a collection of \allowable" coe�cient vectors � 2 Rp. We will be interested inapproximations to y obeying these constraints and having small complexity. In a generalsetting, one can think of many interesting constraints to impose on allowable coe�cients; forexample, that coe�cients should be positive, that coe�cients should generate a monotonefunction, that nonzero coe�cients are attached to pairwise orthogonal elements.

De�ne the C-constrained minimum empirical complexity estimate

f� = argminf ~f=�~�: ~�2Cg

K( ~f; y):

In a moment we will prove theComplexity Bound. Suppose y = f + z, where z is i.i.d. N(0; 1). Fix C � Rp

and �x � > 8, and consider the C-constrained minimum-complexity model selection with� = � � (1 +p2 log p).

E K(f�; f) � A(�) � �2�2 + min

f ~f=�~�: ~�2CgK( ~f; f)

!: (13.1)

This shows that the empirical minimum complexity estimate is not far o� from mini-mizing the theoretical complexity.

13.1.2 Relation to CART and BOB

We now explain why the complexity bound implies Theorems 7.1 and 7.2.We begin with the observation that the empirical complexity K(f; y) is just what we

earlier called a complexity-penalized sum of squares.Assume now that the dictionary D is the collection of all anisotropic Haar functions.

Two constraint sets are particularly interesting.First, let CBOB be the collection of all coe�cient vectors � which arise from combinations

of atoms that all belong together in some ortho basis built from the Anisotropic Haardictionary. Remember, the dictionary has p� N atoms. So at most N elements of � canbe nonzero at once under this constraint. Also, we have seen in section 3 that each basis inthe anisotropic Haar system corresponds to a certain decorated tree so this constraint saysthat collections of coe�cients which are allowed to be nonzero simultaneously correspondto certain collections of indices. This constaint can be made quite explicit and algorithmic,although we do not go into details here.

If we optimize the empirical complexity K( ~f; y) over all ~f arising from a � 2 CBOB weget exactly the estimator (4.1)-(4.3). We encourage the reader to check this fact.

Second, there is the CART constraint. Let CCART be the collection of all vectors � forwhich the nonzero coe�cients in the corresponding � only refer to atoms which can appeartogether in an orthogonal basis, and fow which the nonzero coe�cients only occur in anhereditary pattern in that basis. We remark that CCART � CBOB.

31

If we optimize the empirical complexity K( ~f; y) over all ~f arising from a � 2 CCART weget exactly the estimator (2.4)-(2.5). We again encourage the reader to check this fact.

We now make two simple observations about the minimum complexity formalism, validfor any C, which the reader should verify:

K1. The theoretical complexity of f� upperbounds the predictive loss:

K(f�; f) � kf� � fk22:

K2. The minimum theoretical complexity is within a logarithmic factor of the ideal risk:

min~f2C

K( ~f; f) = min~f2C

k ~f � fk22 + �2 � �2 �N( ~f)

� �2 �min~f2C

(k ~f � fk22 + �2 �N( ~f))

= �2 �min~�2C

(k�~� � ��k22 + �2 �#fj : ~�j 6= 0g)

= �2 � R(IdealC; f):

These observations, translated into the cases CBOB and CCART , give Theorems 7.1 and7.2 respectively.

13.1.3 Proof of the Complexity Bound

In what follows we assume the noise level �2 = 1. We follow Donoho and Johnstone (1995)line-by-line, who analyzed the unconstrained case C = Rp. Exactly the same analysisapplies in the constrained case.

We �rst let f0 denote a model of minimum theoretical complexity:

K(f0; f) = min~f2C

K( ~f; f):

As f� has minimum empirical complexity,

K(f�; y) � K(f0; y):

As kf� � yk22 = kf� � f � zk22 we can relate empirical and theoretical complexities by

K( f�; y) = K(f�; f) + 2hz; f � f�i+ kzk22;

and so combining the last two displays,

K(f�; f) � K(f0; f) + 2hz; f� � f0i:

Now de�ne the random variable

W(k) = supfhz;m2 �m1i : kmj � fk22 � k; �2N(mj) � kg:

ThenK(f�; f) � K(f0; f) + 2W(K(f�; f)):

32

This display shows the key idea. It turns out that W(k) � k for all large k, and so thisdisplay forces K(f�; f) to be not much larger than K(f0; f).

Denote the minimum theoretical complexity by K0 = K(f0; f). De�ne kj = 2j � (1 �8=�)�1 �max(K0; �2) for j � 0. De�ne the event

Bj = fW(k) � 4k=� : for all k � kjg :

On the event Bj , the inequalityk � K0 + 2W(k)

has no solutions for k � kj. Hence, on event Bj ,

K(f�; f) � kj :

It follows that

EK(f�; f) �1Xj=0

kj+1ProbfK(f�; f) 2 [kj ; kj+1)g

�1Xj=0

kj+1ProbfK(f�; f) � kjg

�1Xj=0

kj+1ProbfBcjg:

By Lemma 13.1 we get

EK(f�; f) � k0 �Xj�0

2j+1=(2j)!

� max(K0; �2) � (1� 8=�)�1 � 6:

Hence, the complexity bound (13.1) holds, with A(�) = (1� 8=�)�1 � 6.

Lemma 13.1 (Donoho and Johnstone, 1994)

ProbfBcjg � 1=(2j)!:

The proof depends on tail bounds for Chi-Squared variables, which, ultimately, dependon concentration-of-measure estimates (e.g. Borell-Tsirel'son inequality).

13.2 Proof of Lemma 8.4

The proof of each display is similar, so we just discuss the �rst. Fix a rectangle R = I1� I2with jI1j = 2�j1 . Let R1;0 and R1;1 denote the left and right halves.

hf; �1Ri = 2j=2�Z

R1;1fdxdy �

ZR1;0

fdxdy

�:

Hence for the very special increment h = 2�j1ZR1;0

(Dhf)(x; y)dxdy =ZR1;0

(f(x+ h; y)� f(x; y))dxdy

=ZR1;1

fdxdy �ZR1;0

fdxdy = 2�j=2hf; �1Ri:

33

For any sumP

R over rectangles R with disjoint interiors,

XR

jhf; �1Rijp = 2jp=2XR

jZR1;0

D1hf jp:

Now if 1=p+ 1=p0 = 1,

jZR1;0

D1hf j � kD1

hfkLp(R1;0)k1kLp0(R1;0);

so XR

jhf; �1Rijp � 2j(p=2+(1�p) �XR

kD1hfkpLp(R1;0)

:

Now ifP

R is interpreted to mean the sum over a partition of [0; 1]2 by congruent rectangles,then X

R

kD1hfkpLp(R)

= kD1hfkpLp(Q1

h);

and so from kD1hfkLp(R1;0) � kD1

hfkLp(R) we conclude that

(X

R2R(j)

jhf; �1Rijp)1=p � 2j(1=p�1=2)kD1hfkLp(Q1

h)

� 2j(1=p�1=2) � h�1 � C = 2�j1�1 � 2j(1=p�1=2) � C:

13.3 Proof of Lemma 8.6

We assume for the proof below that �1 and �2 are mutually irrational. Very slight modi�-cations allow to handle the exceptional cases.

Think of the quarterplane consisting of (x; y) with x; y � 0 as a collection of square\unit cells", with vertices on the integer lattice. Think of the set where x�1 = y�2 as a rayS in this quarterplane, originating at (0; 0).

Let pj = (j1(j); j2(j)) denote the sequence of pairs of values j1�1 = y�2 where j1+j2 = j.Let p�j = (j�1(j); j

�2(j)) denote the sequence of pairs of values obtained from the optimal

sequential partitioning of de�nition 8.5.Our Claim, to be established below: pj and p�j always belong to the same unit cell.

It follows from this Claim that j�1(j) > j1(j)� 1 and j�2(j) > j2(j)� 1; as a result

max(2�j�

1 (j)�1 ; 2�j�

2(j)�1) � 2 �max(2�j1(j)�1; 2�j2(j)�2);

and the lemma follows.The Claim is proved by induction. Indeed, at j = 0, pj = p�j = 0. So the claim is true

at j = 0.For the inductive step, suppose the Claim is true for steps 0 � j � J ; we prove it for

J + 1. Let Cj denote \the" unit cell containing pj , where, if several cells qualify, we selecta cell having pj on the skew diagonal.

Under this convention, at each step j, pj lies on the skew diagonal through this cellwhich joins its upper left corner to its lower right corner. Supposing the Claim is true atstep j, p�j is either at the upper left corner or at the lower right corner of the cell. Notealso that Cj+1 is either above Cj , or to the right of Cj .

With this set-up, the inductive step requires two things: (i) that if p�J is at the lowerright corner of CJ , and CJ+1 is above CJ , then, p�J+1 is above p�J i.e. b2(J + 1) = 1; (ii)

34

that if p�J is at the upper left corner of CJ , and if CJ+1 is the cell to the right of CJ , thenp�J+1 is to the right of p

�J i.e. b1(J + 1) = 1.

Now note that the trajectory of p�J is being determined by greedy minimization of thefunction f(x; y) = max(2��1x; 2��2y) by paths through integer lattice points. Below the rayS, @

@xf(x; y) = 0. We conclude that unit moves in the x-direction are useless when one is

below S. On the other hand, below S, @@yf(x; y) < 0. So a unit move in the y-direction,

if it is available is useful. Above the ray S, the situation is reversed: @@yf(x; y) = 0. We

conclude that any move in the y-direction is useless when one is above S. But a unit movein the x-direction, if it is available is useful.

Suppose one is in case (i) of the above paragraph. Then one knows that the upper rightvertex of CJ is below or on the ray S. It follows that a full unit move in the y direction isavailable and useful. The greedy algorithm will certainly take it, and case (i) is established.

Suppose one is in case (ii) of the above paragraph. Then one knows that the upper rightvertex of CJ is above or on the ray S. It follows that a full unit move in the x directionis both available and useful. The greedy algorithm will certainly take it, and case (ii) isestablished.

13.4 Proof of Lemma 8.10

De�neN(�) = supf#fi : j�i(f; BAB(�1; �2))j � �g : f 2 F �1;�2

p (C)g:The property in question amounts to the assertion that

N(�)�+1=2� � K � C; 8� > 0: (13.2)

By Corollary 8.7, there are constants j = j(C) so that for f 2 F �1;�2p (C), the coe�cients

in BAB(�1; �2) obey

(XR(j)

j�Rjp)1=p � j :

Now de�nen(�; d; ) = supf#fi : j�ij � �g : � 2 Rd; k�k`p � g:

ThenN(�) � 1 +

Xj�0

n(�; 2j ; j);

where the j are as above. Easy calculations (see (10.4) and (10.5)) yield n(�; d; ) =min(d; ( =�)p); from j = C2�j(�+1=2�1=p) we get (13.2).

13.5 Proof of Lemma 8.11

The proof is an application of the following fact, called the \incompressibility of Hyper-cubes" in [8]. Suppose that H is an orthogonal hypercube symmetric about zero; then itcan be written

Pj �jgj where the gj are orthogonal and the � vary throughout the cube

j�j j � �. We call any basis starting with elements g1; g2; : : : ; gm a natural basis for H. Inthat basis, H is rotated so that the axes cut orthogonally through its faces.

35

Let B be a natural basis for such a H and let � = �(H;B) be the body of coe�cientsof H in that basis. Let U be any orthogonal matrix. Then for absolute constants c(p), and0 < p � 2

sup�2�

kU�klp � c(p) sup�2�

k�klp :

In [8], this is shown to be a consequence of Khintchine's inequalityTo use this, we argue by contradiction. Suppose that the hypotheses of the Lemma

hold, and yet for a certain basis B�, ��(�(F;B�)) = � � � where � > 0. Then for 0 < � < �,we have the weak-type inclusion �(F ;B�) � wl���. Equally, we have the stronger inclusion�(F ;B�) � l���.

Let Hj be the j-th hypercube in the sequence posited by the Theorem, and let Bj be anatural basis for Hj . There is an orthogonal matrix Uj so that �(f;B�) = Uj�(f;Bj).

suph2Hj

k�(h;B�)kl��� = suph2Hj

kUj�(h;Bj)kl���

� c(� � �) suph2Hj

k�(h;Bj)kl���

On the other hand,

suph2Hj

k�(h;Bj)kl��� = m1=���j �j

= (m1=�j �j)m

1=(���)�1=�j

� c0m1=(���)�1=�j !1:

Hence �(F ;B�) 6� l��� for any � > 0. This contradiction proves the Lemma.

13.6 Proof of Lemma 8.12

The Construction. Let g be a smooth function on R2 supported inside the unit square[0; 1]2, whose support contains the half-square [1=2; 3=4]2. Suppose that k @

@xgkL1 � � and

that k @@ygkL1 � �. Suppose also that kgkL2 = 1.

Let R(j) be the tiling of [0; 1] selected at level j by BAB(�1; �1). As this is a spatiallyhomogeneous basis, all tiles are congruent. For an R 2 R(j), let gR denote the translationand dilation of g so that it \just �ts" inside R: i.e. supp(gR) � R and R=2 � supp(gR)where R=2 denotes the rectangle with the same center homothetically shrunk by a factorof 50%.

Let �j =16�C2

�j(�+1=2), and set

Hj =X

R2R(j)

�RgR; where j�Rj � �j :

Property 1. We �rst note that Hj obeys the dimension inequality assumed in thestatement of the lemma, with K = 1=6�. Set � = �1�2=(�1 + �2) and � = 1=(2�+ 1). Withmj = 2j the dimension of Hj and �j the sidelength, one gets

m1=�j �j = C0 > 0

with C0 = C=6�.

36

Property 2. The key claim about Hj is the embedding Hj � F �1;�2p (C): for any f 2 Hj ,

sup0<h<1

h��1 jjD1hfkLp(Q1

h) � C (13.3)

sup0<h<1

h��2 jjD2hfkLp(Q2

h) � C

We prove the �rst inequality only, starting with estimates for di�erences of gR. Let Rbe of side 2�j1 by 2�j2 .

Let h > 2�j1 , and let Rh denote the translation of R by \to the left by h". Then if(x; y) 2 Rh, D

1hgR(x; y) = gR(x + h; y), while if (x; y) 2 R, D1

hgR(x; y) = �gR(x; y). Notefurther that Rh is not generally part of the tiling R(j), but instead overlaps with two tiles,R�h and R+

h , say. Let bR(x; y) = gR(x+h; y)1R�h(x; y), and cR(x; y) = gR(x+h; y)1R+

h(x; y).

Then D1hgR = aR + br + cR, where aR is supported in R, and bR and cR are supported in

R�h . We have for each R

kaRk1; kbRk1; kcRk1 � �2j=2:

Now consider the case 0 < h � 2�j1 . Let bR(x; y) = gR(x+h; y)1Rh(x; y), and cR(x; y) =

0 and set R+h = R�

h = Rh. Then D1hgR = aR+ br+ cR, where aR is supported in R, and bR

and cR are supported in R�h . We have

kaRk1; kbRk1 � min(h2j1 ; 1)�2j=2:

Now consider increments of f =P

R �RgR. Rearrange the terms to have commonsupport:

D1hf =

XR

�RaR + �R�h

bR + �R+h

cR:

Nowk(�R�

h)k`p ; k(�R+

h)k`p � k(�R)k`p

and so

kD1hfkp � k

XR

�RaRkp + kX

�R�h

bRkp + kXR

�R+h

cRkp

� k(�R)k`p maxRkaRk1 + k(�R�

h

)k`p maxRkbRk1 + k(�R+

h

)k`p maxRkcRk1

� 3�min(h2j1 ; 1)k(�R)k`p :

Hence

sup0<h<1

h��1kD1hfkp � sup

h

h��13�k(�R)k`p min(h2j1 ; 1)2j(1=2�1=p)

= 3�k(�R)k`p2j(1=2�1=p) suph

h��1 min(h2j1 ; 1)

= 3�k(�R)k`p2j(1=2�1=p)2j1�1

Now from the proof of Lemma 8.6 we know that for BAB(�1; �2),

2j1�1 � 2 � 2j�

we conclude thatsup

0<h<1h��1kD1

hfkp � 6�2j(�+1=2�1=p)k�kp � C

This establishes (13.3).

37

13.7 Proof of Lemma 9.3

Recall the proof of Lemma 8.12. Let � > 0 be given, and pick j so that the values �j ; �j+1de�ned in that lemma satisfy

�j+1 < � � �j :

Construct the Hypercube Hj exactly as in that Lemma, only using sidelength � in placeof �j .

We �rst note that the generating elements gR are orthogonal with respect to the sam-pling measure `2N , because they are disjointly supported. We also note that because of thedyadic structure of the sampling and the congruency of the hypercubes, each gR has thesame `2N norm as every other gR. Call this norm �(�). Finally, we note that

� = � � ( 1

M1M2

Xg(xi)

2)1=2=jjgjjL2[0;1]2;

where the sum is over an M1 by M2 array of grid points, where Mi = 2J�ji(j). Hence Hj

is an orthogonal hypercube for `2N . The asymptotics of the sidelength can be derived fromthe fact that g is nice, the grid is becoming �ner as j increases, and so the indicated sumconverges to the corresponding integral, whence

� = � � (1 + o(1)):

The Hypercube Hj that results has two properties: First,

m1=�j � = C1(�)

whereC1(�) = C0 � (�=�j) > C0 � (�j+1=�j) = C=6�2�(�+1=2):

Hence the dimensionality of the hypercube obeys (9.4), with K = (6 � � � 2(�+1=2))� .Second,

Hj � F �1;�2p (C):

This inclusion follows exactly as in Lemma 8.12.

References

[1] A. Barron and T. Cover (1991)Minimum Complexity Density estimation. IEEE Trans.

Info. Theory 37, 1034-1054.

[2] N. Bennett. Computing Best Dyadic Local Cosine and Wavelet Packet Bases.Manuscript, 1995.

[3] L. Birg�e and P. Massart (1994) From Model Selection to Adaptive Estimation. Toappear, Le Cam Festschrift, D. Pollard and G. Yang eds.

[4] L. Breiman, J. Friedman, R. Olshen, and C.J Stone. Classi�cation and Regression

Trees. Belmont, CA: Wadsworth, 1983.

[5] R. R. Coifman, Y. Meyer, S. Quake, and M. V. Wickerhauser, \Wavelet analysis andsignal processing", pp. 363-380 in Wavelets and Their Applications, J.S. Byrnes, J.L.Byrnes, K.A. Hargreaves, and K. Berry, eds., Kluwer Academic: Boston, 1994.

38

[6] R. R. Coifman and M. V. Wickerhauser, \Entropy-based algorithms for best-basisselection", IEEE Trans. Info. Theory, 38, pp. 713-718, 1992.

[7] D.L. Donoho, \De-Noising by Soft Thresholding", IEEE Trans. Info. Thry., 41, pp.613-627, 1995.

[8] D.L. Donoho, Unconditional bases are optimal bases for data compression and for sta-

tistical estimation, Applied and Computational Harmonic Analysis, 1, 100-115, 1993.

[9] D.L. Donoho, \Abstract Statistical Estimation and Modern Harmonic Analysis", Proc.1994 Int. Cong. Math., to appear 1995.

[10] D.L. Donoho and I.M. Johnstone, \Ideal spatial adaptation via wavelet shrinkage",Biometrika, vol. 81, pp. 425-455, 1994.

[11] D.L. Donoho and I.M. Johnstone, Ideal De-Noising in a basis chosen from a library of

orthonormal bases, Comptes Rendus Acad. Sci. Paris A, 319, 1317-1322, 1994.

[12] D.L. Donoho and I.M. Johnstone, \Empirical Atomic Decomposition", Manuscript.

[13] D.L. Donoho, I.M. Johnstone, G. Kerkyacharian, and D. Picard, \Wavelet Shrinkage:Asymptopia?", Journ. Roy. Stat. Soc. Ser B, vol. 57, no. 2, pp. 301-369, 1995.

[14] J. Engel, A simple wavelet approach to nonparametric regression from recursive par-titioning schemes, J. Multivariate Analysis 49 242 { 254, 1994.

[15] D. Foster and E.I. George, The risk in ation factor in multiple linear regression. Ann.Statist. 1995.

[16] M. Neumann and R. von Sachs, Anisotropic wavelet smoothing, with an applicationto nonstationary spectral analysis. Mss, 1994.

[17] V. Temlyakov, Approximation of functions with bounded mixed derivatives, TrudyMat. Inst. Akad. Nauk. SSSR, 178, 1-112, 1986.

[18] C.M. Thiele and L.F. Villemoes (1994) A fast algorithm for Adapted Time-FrequencyTilings. Manuscript.

39