data compression and harmonic analysis - information...

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998 2435

Data Compression and Harmonic AnalysisDavid L. Donoho, Martin Vetterli,Fellow, IEEE, R. A. DeVore, and Ingrid Daubechies,Senior Member, IEEE

(Invited Paper)

Abstract—In this paper we review some recent interactionsbetween harmonic analysis and data compression. The storygoes back of course to Shannon’sR(D) theory in the case ofGaussian stationary processes, which says that transforming intoa Fourier basis followed by block coding gives an optimal lossycompression technique; practical developments like transform-based image compression have been inspired by this result. Inthis paper we also discuss connections perhaps less familiar tothe Information Theory community, growing out of the fieldof harmonic analysis. Recent harmonic analysis constructions,such as wavelet transforms and Gabor transforms, are essentiallyoptimal transforms for transform coding in certain settings.Some of these transforms are under consideration for futurecompression standards.

We discuss some of the lessons of harmonic analysis in thiscentury. Typically, the problems and achievements of this fieldhave involved goals that were not obviously related to practicaldata compression, and have used a language not immediatelyaccessible to outsiders. Nevertheless, through an extensive gen-eralization of what Shannon called the “sampling theorem,”harmonic analysis has succeeded in developing new forms offunctional representation which turn out to have significantdata compression interpretations. We explain why harmonicanalysis has interacted with data compression, and we describesome interesting recent ideas in the field that may affect datacompression in the future.

Index Terms—Besov spaces, block coding, cosine packets,�-entropy, Fourier transform, Gabor transform, Gaussian proc-ess, Karhunen–Loeve transform, Littlewood–Paley theory, non-Gaussian process,n-widths, rate-distortion, sampling theorem,scalar quantization, second-order statistics, Sobolev spaces, sub-band coding, transform coding, wavelet packets, wavelet trans-form, Wilson bases.

“Like the vague sighings of a wind at evenThat wakes the wavelets of the slumbering sea.”

Shelley, 1813

Manuscript received June 1, 1998; revised July 6, 1998. The work of D.L. Donoho was supported in part by NSF under Grant DMS-95-05151, byAFOSR under Grant MURI-95-F49620-96-1-0028, and by other sponsors.The work of M. Vetterli was supported in part by NSF under Grant MIP-93-213002 and by the Swiss NSF under Grant 20-52347.97. The work of R. A.DeVore was supported by ONR under Contract N0014-91-J1343 and by ArmyResearch Office under Contract N00014-97-0806. The work of I. Daubechieswas supported in part by NSF under Grant DMS-9706753, by AFOSR underGrant F49620-98-1-0044, and by ONR under Contract N00014-96-1-0367.

D. L. Donoho is with Stanford University, Stanford, CA USA 94205.M. Vetterli is with the Communication Systems Division, the Swiss

Federal Institute of Technology, CH-1015 Lausanne, Switzerland, and withthe Department of Electrical Engineering and Computer Science, Universityof California, Berkeley, CA 94720 USA.

R. A. DeVore is with the Department of Mathematics, University of SouthCarolina, Columbia, SC 29208 USA.

I. Daubechies is with the Department of Mathematics, Princeton University,Princeton, NJ 08544 USA.

Publisher Item Identifier S 0018-9448(98)07005-9.

“ASBOKQTJEL”Postcard from J. E. Littlewood to A. S. Besicovich,

announcing A. S. B.’s election to fellowship at Trinity

“ the 20 bits per second which, the psychologists assure us,the human eye is capable of taking in, ”

D. Gabor, Guest Editorial,IRE Trans. Inform Theory,Sept. 1959.

I. INTRODUCTION

DATA compression is an ancient activity; abbreviation andother devices for shortening the length of transmitted

messages have no doubt been used in every human society.Language itself is organized to minimize message length, shortwords being more frequently used than long ones, accordingto Zipf’s empirical distribution.

Before Shannon, however, the activity of data compressionwas informal andad hoc; Shannon created a formal intellectualdiscipline for both lossless and lossy compression, whose 50thanniversary we now celebrate.

A remarkable outcome of Shannon’s formalization of prob-lems of data compression has been the intrusion of so-phisticated theoretical ideas into widespread use. The JPEGstandard, set in the 1980’s and now in use for transmittingand storing images worldwide, makes use of quantization, run-length coding, entropy coding, and fast cosine transformation.In the meantime, software and hardware capabilities havedeveloped rapidly, so that standards currently in process ofdevelopment are even more ambitious. The proposed standardfor still image compression—JPEG-2000—contains the pos-sibility for conforming codecs to use trellis-coded quantizers,arithmetic coders, and fast wavelet transforms.

For the authors of this paper, one of the very strikingfeatures of recent developments in data compression has beenthe applicability of ideas taken from the field of harmonicanalysis to both the theory and practice of data compres-sion. Examples of this applicability include the appearanceof the fast cosine transform in the JPEG standard and theconsideration of the fast wavelet transform for the JPEG-2000standard. These fast transforms were originally developed inapplied mathematics for reasons completely unrelated to thedemands of data compression; only later were applications incompression proposed.

John Tukey became interested in the possibility of acceler-ating Fourier transforms in the early 1960’s in order to enablespectral analysis of long time series; in spite of the fact thatTukey coined the word “bit,” there was no idea in his mindat the time of applications to data compression. Similarly,

0018–9448/98$10.00 1998 IEEE

2436 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

the construction of smooth wavelets of compact support wasprompted by questions posed implicitly or explicitly by themultiresolution analysis concept of Mallat and Meyer, and not,at that time, by direct applications to compression.

In asking about this phenomenon—the applicability of com-putational harmonic analysis to data compression—there are,broadly speaking, two extreme positions to take.

The first, maximalist position holds that there is a deepreason for the interaction of these disciplines, which can beexplained by appeal to information theory itself. This pointof view holds that sinusoids and wavelets will necessarilybe of interest in data compression because they have a spe-cial “optimal” role in the representation of certain stochasticprocesses.

The second,minimalist position holds that, in fact, com-putational harmonic analysis has exerted an influence on datacompression merely by happenstance. This point of view holdsthat there is no fundamental connection between, say, waveletsand sinusoids, and the structure of digitally acquired data tobe compressed. Instead, such schemes of representation areprivileged to have fast transforms, and to be well known, tohave been well studied and widely implemented at the momentthat standards were being framed.

When one considers possible directions that data compres-sion might take over the next fifty years, the two points ofview lead to very different predictions. The maximalist posi-tion would predict that there will be continuing interactionsbetween ideas in harmonic analysis and data compression;that as new representations are developed in computationalharmonic analysis, these will typically have applications todata compression practice. The minimalist position wouldpredict that there will probably be little interaction betweenthe two areas in the future, or that what interaction does takeplace will be sporadic and opportunistic.

In this paper, we would like to give the reader the back-ground to appreciate the issues involved in evaluating thetwo positions, and to enable the reader to form his/her ownevaluation. We will review some of the connections thathave existed classically between methods of harmonic analysisand data compression, we will describe the disciplines oftheoretical and computational harmonic analysis, and we willdescribe some of the questions that drive those fields.

We think there is a “Grand Challenge” facing the disciplinesof both theoretical and practical data compression in the future:the challenge of dealing with the particularity of naturallyoccurring phenomena. This challenge has three facets:

GC1 Obtaining accurate models of naturally occurringsources of data.

GC2 Obtaining “optimal representations” of such models.

GC3 Rapidly computing such “optimal representations.”

We argue below that current compression methods might befar away from the ultimate limits imposed by the underlyingstructure of specific data sources, such as images or acousticphenomena, and that efforts to do better than what is donetoday—particularly in specific applications areas—are likelyto pay off.

Moreover, parsimonious representation of data is a fun-damental problem with implications reaching well beyondcompression. Understanding the compression problem for agiven data type means an intimate knowledge of the modelingand approximation of that data type. This in turn can beuseful for many other important tasks, including classification,denoising, interpolation, and segmentation.

The discipline of harmonic analysis can provide interestinginsights in connection with the Grand Challenge.

The history of theoretical harmonic analysis in this cen-tury repeatedly provides evidence that in attacking certainchallenging and important problems involving characterizationof infinite-dimensional classes of functions, one can makeprogress by developing new functional representations basedon certain geometric analogies, and by validating that thoseanalogies hold in a quantitative sense, through a norm equiv-alence result. Also, the history of computational harmonicanalysis has repeatedly been that geometrically motivatedanalogies constructed in theoretical harmonic analysis haveoften led to fast concrete computational algorithms.

The successes of theoretical harmonic analysis are interest-ing from a data compression perspective. What the harmonicanalysts have been doing—showing that certain orthobasesafford certain norm equivalences—is analogous to the classicalactivity of showing that a certain orthobasis diagonalizes aquadratic form. Of course, the diagonalization of quadraticforms is of lasting significance for data compression in con-nection with transform coding of Gaussian processes. So onecould expect the new concept to be interestinga priori. In fact,the new concept of “diagonalization” obtained by harmonicanalysts really does correspond to transform coders—for ex-ample, wavelet coders and Gabor coders.

The question of whether the next 50 years will displayinteractions between data compression and harmonic analysismore like a maximalist or a minimalist profile is, of course,anyone’s guess. This paper provides encouragement to thosetaking the maximalist position.

The paper is organized as follows. At first, classical resultsfrom rate-distortion theory of Gaussian processes are reviewedand interpreted (Sections II and III). In Section IV, wedevelop the functional point of view, which is the setting forharmonic analysis results relevant to compression, but whichis somewhat at variance with the digital signal processingviewpoint. In Section V, the important concept of Kolmogorov-entropy of function classes is reviewed, as an alternate

approach to a theory of compression. In Section VI, practicaltransform coding as used in image compression standards isdescribed. We are now in a position to show commonalitiesbetween the approaches seen so far (Section VII), and thento discuss limitations of classical models (Section VIII) andpropose some variants by way of simple examples. This leadsto pose the “Grand Challenges” to data compression as seenfrom our perspective (Section IX), and to overview howHarmonic Analysis might participate in their solutions. Thisleads to a survey of Harmonic Analysis results, in particu-lar on norm equivalences (Sections XI–XIII) and nonlinearapproximation (Section XIV). In effect, one can show thatharmonic analysis, which is effective at establishing norm

DONOHO et al.: DATA COMPRESSION AND HARMONIC ANALYSIS 2437

equivalences, leads to coders which achieve the-entropy offunctional classes (Section XV). This has a transform codinginterpretation (Section XVI), showing a broad analogy be-tween the deterministic concept of unconditional basis and thestochastic concept of Karhunen–Lo`eve expansion. In SectionXVII, we discuss the role of tree-based ideas in harmonicanalysis, and the relevance for data compression. SectionXVIII briefly surveys some harmonic analysis results ontime–frequency-based methods of data compression. The factthat many recent results from theoretical harmonic analysishave computationally effective counterparts is described inSection XIX. Practical coding schemes using or having led tosome of the ideas described thus far are described in SectionXX, including current advanced image compression algorithmsbased on fast wavelet transforms. As a conclusion, a few keycontributors in harmonic analysis are used to iconify certainkey themes of this paper.

II. FOR GAUSSIAN PROCESSES

Fifty years ago, Claude Shannon launched the subjectof lossy data compression of continuous-valued stochasticprocesses [83]. He proposed a general idea, the rate-distortionfunction, which in concrete cases (Gaussian processes) leadsto a beautifully intuitive method to determine the number ofbits required to approximately represent sample paths of astochastic process.

Here is a simple outgrowth of the Shannon theory, impor-tant for comparison with what follows. Suppose is aGaussian zero-mean stochastic process on an intervalandlet denote the minimal number of codewords neededin a codebook so that

(2.1)

Then Shannon proposed that in an asymptotic sense

(2.2)

where is the rate-distortion function for

(2.3)

with the usual mutual information, given formally by

(2.4)

Here can be obtained in parametric form from aformula which involves functionals of the covariance kernel

; more specifically of the eigen-values In the form first published by Kolmogorov (1956),but probably known earlier to Shannon, for we have

(2.5)

where

(2.6)

The random process achieving the minimum of the mutualinformation problem can be described as follows. It has a

covariance with the same eigenfunctions as that of, butthe eigenvalues are reduced in size:

To obtain a codebook achieving the value predicted by, Shannon’s suggestion would be to sample realiza-

tions from the reproducing process realizing the minimumof the least mutual information problem.

Formally, then, the structure of the optimal data compres-sion problem is understood by passing to the Karhunen–Loeveexpansion

In this expansion, the coefficients are independent zero-meanGaussian random variables. We have a similar expansion forthe reproducing distribution

The process has only finitely many nonzero coefficients,namely, those coefficients at indices where ; let

denote the indicated subset of coefficients. Randomcodebook compression is effected by comparing the vectorof coefficients with a sequence ofcodewords for lookingfor a closest match in Euclidean distance. The approachjust outlined is often called “reverse waterfilling,” since onlycoefficients above a certain water level are described in thecoding process.

As an example, let and let be the BrownianBridge, i.e., the continuous Gaussian process with covariance

This Gaussian process hasand can be obtained by taking a Brownian motion

and “pinning” it down atThe covariance kernel has sinusoids for eigenvectors:

, and has eigenvalues The subsetamounts to a frequency band of the first

frequencies as (Here and below, we useto mean that the two expressions are equivalent

to within multiplicative constants, at least as the underlyingargument tends to its limit.) Hence the compression schemeamounts to “going into the frequency domain,” “extracting thelow frequency coefficients,” and “comparing with codebookentries.” The number of low frequency coefficients to keep isdirectly related to the desired distortion level. The achieved

in this case scales asAnother important family of examples is given by stationary

processes. In this case, the eigenfunctions of the covarianceare essentially sinusoids, and the Karhunen–Loeve expansionhas a more concrete interpretation. To make things simple andanalytically exact, suppose we are dealing with the circle

, and considering stationarity with respect to circularshifts. The stationarity condition is where

denotes circular (clock) arithmetic. The eigenfunctions


of are the sinusoids

and the eigenvalues are the Fourier coefficients of

We can now identify the index with frequency, andthe Karhunen–Lo`eve representation effectively says thatthe Fourier coefficients of are independently zero-meanGaussian, with variances and that the reproducing process

has Fourier coefficients which are independent Gaussiancoefficients with variances For instance, consider thecase as ; then the stationary processhas nearly -derivatives in a mean-square sense.For this example, the band amounts to the first

frequencies. Hence once again thecompression scheme amounts to “going into the frequencydomain,” “extracting the low frequency coefficients,” and“comparing with codebook entries,” and the number oflow frequency coefficients retained is given by the desireddistortion level. The achieved in this case scales as

III. I NTERPRETATIONS OF

The compression scheme described by the solution ofin the Gaussian case has several distinguishing features.

• Transform Coding Interpretation:Undoubtedly the mostimportant thing to read off from the solution is thatdata compression can be factored into two components: atransform step followed by a coding step. The transformstep takes continuous data and yields discrete sequencedata; the coding step takes an initial segment of thissequence and compares it with a codebook, storing thebinary representation of the best matching codeword.

• Independence of the Transformed Data:The transformstep yields uncorrelated Gaussian data, hencestochas-tically independentdata, which are, after normalizationby factors , identically distributed. Hence, theapparently abstract problem of coding a processbecomes closely related to the concrete problem of codinga Gaussian memoryless source, under weighted distortionmeasure.

• Manageable Structure of the Transform:The transformitself is mathematically well-structured. It amounts toexpanding the object in the orthogonal eigenbasisassociated with a self-adjoint operator, which at theabstract level is a well-understood notion. In certainimportant concrete cases, such as the examples we havegiven above, the basis even reduces to a well-knownFourier basis, and so optimal coding involves explicitlyharmonic analysisof the data to be encoded.

There are two universality aspects we also find remarkable:

• Universality Across Distortion Level:The structure ofthe ideal compression system does not depend on thedistortion level; the same transform and coder structureare employed, but details of the “useful components”

change.

• Universality Across Stationary Processes:Large classesof Gaussian processes will share approximately the samecoder structure, since there are large classes of covariancekernels with the same eigenstructure. For example, allstationary covariances on the circle have the same eigen-functions, and so, there is a single “universal” transformthat is appropriate for transform coding of all suchprocesses—the Fourier transform.

At a higher level of abstraction, we remark on two furtheraspects of the solution.

• Dependence on Statistical Characterization:To the ex-tent that the orthogonal transform is not universal, itnevertheless depends on the statistical characterizationof the process in an easily understandable way—via theeigenstructure of the covariance kernel of the process.

• Optimality of the Basis:Since the orthobasis underlyingthe transform step is the Karhunen–Loeve expansion, ithas an optimality interpretation independently from itscoding interpretation; in an appropriate ordering of thebasis elements, partial reconstruction from the firstcomponents gives the best mean-squared approximationto the process available from any orthobasis.

These features of the solution are so striking andso memorable, that it is unavoidable to incorporate theseinterpretations as deep “lessons” imparted by the calcu-lation. These “lessons,” reinforced by examples such as thosewe describe later, can harden into a “world view,” creatingexpectations affecting data compression work in practicalcoding.

• Factorization:One expects to approach coding problemsby compartmentalization: attempting to design a two-stepsystem, with the first step a transform, and the secondstep a well-understood coder.

• Optimal Representation:One expects that the transformassociated with an optimal coder will be an expansion ina kind of “optimal basis.”

• Empiricism: One expects that this basis is associatedwith the statistical properties of the process and so, ina concrete application, one could approach coding prob-lems empirically. The idea would be to obtain empiricalinstances of process data, and to accurately model the co-variance kernel (dependence structure) of those instances;then one would obtain the corresponding eigenfunctionsand eigenvalues, and design an empirically derived near-ideal coding system.

These expectations are inspiring and motivating. Unfortu-nately, there is really nothing in the Shannon theory whichsupports the idea that such “naive” expectations will apply


outside the narrow setting in which the expectations wereformed. If the process data to be compressed are not Gaussian,the derivation mentioned above does not apply, and onehas no right to expect these interpretations to apply.

In fact, depending on one’s attraction to pessimism, it wouldalso be possible to entertain completely opposite expectationswhen venturing outside the Gaussian domain. If we considerthe problem of data compression of arbitrary stochastic pro-cesses, the following expectations are essentially all that onecan apply.

• Lack of Factorization:One does not expect to find anideal coding system for an arbitrary non-Gaussian processthat involves transform coding, i.e., a two-part factoriza-tion into a transform step followed by a step of codingan independent sequence.

• Lack of Useful Structure:In fact, one does not ex-pect to find any intelligible structure whatsoever, inan ideal coding system for an arbitrary non-Gaussianprocess—beyond the minimal structure on the randomcodebook imposed by the problem.

For purely human reasons, it is doubtful, however, that thisset of “pessimistic” expectations is very useful as a workinghypothesis. The more “naive” picture, taking the storyfor Gaussian processes as paradigmatic, leads to the followingpossibility:as we consider data compression in a variety of set-tings outside the strict confines of the original Gaussiansetting, we will discover that many of the expectations formedin that setting still apply and are useful, though perhaps in aform modified to take account of the broader setting. Thus forexample, we might find that factorization of data compressioninto two steps, one of them an orthogonal transform into a kindof optimal basis, is a property of near-ideal coders that we seeoutside the Gaussian case; although we might also find thatthe notion of optimality of the basis and the specific detailsof the types of bases found would have to change. We mighteven find that we have to replace “expansion in an optimalbasis” by “expansion in an optimal decomposition,” movingto a system more general than a basis.

We will see several instances below where the lessons ofGaussian agree with ideal coders under this type ofextended interpretation.

IV. FUNCTIONAL VIEWPOINT

In this paper we have adopted a point of view we callthe functional viewpoint. Rather than thinking of data to becompressed as numerical arrays with integer index , wethink of the objects of interest as functions—functionsof time or functions of space To use terminologythat Shannon would have found natural, we are consideringcompression of analog signals. This point of view is clearlythe one in Shannon’s 1948 introduction of the optimizationproblem underlying theory, but it is less frequentlyseen today, since many practical compression systems startwith sampled data. Practiced IT researchers will find oneaspect of our discussion unnatural: we study the case wherethe index set stays fixed. This seems at variance even

with Shannon, who often let the domain of observation growwithout bound. The fixed-domain functional viewpoint isessential for developing the themes and theoretical connectionswe seek to expose in this paper—it is only through thisviewpoint that the connections with modern Harmonic analysisbecome clear. Hence, we pause to explain how this viewpointcan be related to standard information theory and to practicaldata compression.

A practical motivation for this viewpoint can easily beproposed. In effect, when we are compressing acoustic orimage phenomena, there is truly an underlying analog rep-resentation of the object, and a digitally sampled object is anapproximation to it. Consider the question of an appropriatemodel for data compression of still-photo images. Over time,consumer camera technology will develop so that standardcameras will routinely be able to take photos with severalmegapixels per image. By and large, consumers using suchcameras will not be changing their photographic compositionsin response to the increasing quality of their equipment; theywill not be expanding their field of view in picture taking, butrather, they will instead keep the field of view constant, andso as cameras improve they will get finer and finer resolutionon the same photos they would have taken anyway. So whatis increasing asymptotically in this setting is the resolutionof the object rather than the field of view. In such a setting,the functional point of view is sensible. There is a continuumimage, and our digital data will sooner or later represent avery good approximation to such a continuum observation.Ultimately, cameras will reach a point where the question ofhow to compress such digital data will best be answered byknowing about the properties of an ideal system derived forcontinuum data.

The real reason for growing-domain assumptions in infor-mation theory is a technical one: it allows in many casesfor the proof of source coding theorems, establishing theasymptotic equivalence between the “formal bit rate”and the “rigorous bit rate” In our setting, thisconnection is obtained by considering asymptotics of bothquantities as In fact, it is the setting thatwe focus on here, and it is under this assumption that we canshow the usefulness of harmonic analysis techniques to datacompression.1 This may seem at first again at variance withShannon, who considered the distortion fixed (on a per-unitbasis) and let the domain of observation grow without bound.

We have two nontechnical responses.

• The Future:With the consumer camera example in mind,high-quality compression of very large data sets may soonbe of interest. So the functional viewpoint, and low-distortion coding of the data, may be very interestingsettings in the near future.

• Scaling: In important situations there is a near equiva-lence between the “growing domain” viewpoint and the“functional viewpoint.” We are thinking here of phenom-ena like natural images which have scaling properties: ifwe dilate an image, “stretching it out” to live on a growing

1TheD! 0 case is usually called the fine quantization or high-resolutionlimit in quantization theory; see [48].


domain, then after appropriate rescaling, we get statisticalproperties that are the same as the original image [37],[81]. The relevance to coding is evident, for example,in the stationary Gaussian case for the processdefined in Section II, which has eigenvalues obeying anasymptotic power law, and hence which asymptoticallyobeys scale invariance at fine scales. Associated to a givendistortion level is a characteristic “cutoff frequency”

; dually this defines a scale of approximation;to achieve that distortion level it is necessary only toknow the Fourier coefficients out to frequency ,or to know the samples of a bandlimited version of theobject out to scale This characteristic scaledefines a kind of effective pixel size. As the distortionlevel decreases, this scale decreases, and one has manymore “effective pixels.” Equivalently, one could rescalethe object as a function of so that the characteris-tic scale stays fixed, and then the effective domain ofobservation would grow.

In addition to these relatively general responses, we have aprecise response: allows Source Coding Theorems.To see this, return to the setting of Section II, andthe stochastic process with asymptotic power law eigenvaluesgiven there. We propose grouping frequencies into subbands

The subband boundariesshould be chosen in such a way that they get increasingly

long with increasing but that in a relative sense, measuredwith respect to distance from the origin, they get increasinglynarrow

(4.1)

The Gaussian problem of Section II has the structuresuggesting that one needs to code the first coefficientsin order to get a near-ideal system. Suppose we do this bydividing the first Fourier coefficients of into subbandblocks and then code the subband blocks using the appropriatecoder for a block from a Gaussian independent and identicallydistributed (i.i.d.) source.

This makes sense. For the process we are studying, theeigenvalues decay smoothly according to a power law.The subbands are chosen so that the variancesare roughlyconstant in subbands

(4.2)

Within subband blocks, we may then reasonably regard thecoefficients as independent Gaussian random variables witha common variance. It is this property that would suggest toencode the coefficients in a subband using a coder for an i.i.d.Gaussian source. The problem of coding Gaussian i.i.d. data isamong the most well-studied problems in information theory,and so this subband partitioning reduces the abstract problemof coding the process to a very familiar one.

As the distortion tends to zero, the frequency cutoffin the underlying problem tends to infinity,

and so the subband blocks we must code include longer andlonger blocks farther and farther out in the frequency domain.These blocks behave more and more nearly like long blocksof Gaussian i.i.d. samples, and can be coded more and moreprecisely at the rate for a Gaussian source, for example usinga random codebook. An increasing fraction of all the bitsallocated comes from the long blocks, where the coding isincreasingly close to the rate. Hence we get the asymptoticequality of “formal bits” and “rigorous bits” as

(Of course in a practical setting, as we will discuss fartherbelow, block coding of i.i.d. Gaussian data, is impractical to“instrument;” there is no known computationally efficient wayto code a block of i.i.d. Gaussians approaching the limit.But in a practical case one can use known suboptimal codersfor the i.i.d. problem to code the subband blocks. Note alsothat such suboptimal coders can perform rather well, especiallyin the high-rate case, since an entropy-coded uniform scalarquantizer performs within 0.255 bit/sample of the optimum.)

With the above discussion, we hope to have convinced thereader that our functional point of view, although unconven-tional, will shed some interesting light on at least the high-rate,low-distortion case.

V. THE -ENTROPY SETTING

In the mid-1950’s, A. N. Kolmogorov, who had beenrecently exposed to Shannon’s work, introduced the notionof the -entropy of a functional class, defined as follows. Let

be a domain, and let be a class of functionson that domain; suppose is compact for the norm , sothat there exists an-net, i.e., a system such that

(5.1)

Let denote the minimal cardinality of all such-nets. The Kolmogorov-entropy for is then

(5.2)

It is the least number of bits required to specify any arbitrarymember of to within accuracy In essence, Kolmogorovproposed a notion of data compression forclasses of functionswhile Shannon’s theory concerned compression forstochasticprocesses.

There are some formal similarities between the problemsaddressed by Shannon’s and Kolmogorov’s Tomake these clear, notice that in each case, we considera “library of instances”—either a function class or astochastic process , each case yielding as typical elementsfunctions defined on a common domain—and we measureapproximation error by the same norm

In both the Shannon and Kolmogorov theories we encode byfirst constructing finite lists of representative elements—in onecase, the list is called a codebook; in the other case, a net. Werepresent an object of interest by its closest representative inthe list, and we may record simply the index into our list. Thelength in bits of such a recording is called in the Shannon casethe rate of the codebook; in the Kolmogorov case, the entropyof the net. Our goal is to minimize the number of bits while


Shannon Theory Kolmogorov Theory

Library StochasticRepresenters Codebook NetFidelityComplexity

achieving sufficient fidelity of reproduction. In the Shannontheory this is measured by mean discrepancy across randomrealizations; in the Kolmogorov theory this is measured by themaximum discrepancy across arbitrary members ofThesecomparisons may be summarized in the table at the top ofthis page.

In short, the two theories are parallel—except that one of thetheories postulates a library of samples arrived at by samplinga stochastic process, while the other selects arbitrary elementsof a functional class.

While there are intriguing parallels between the andconcepts, the two approaches have developed separately,

in very different contexts. Work on has mostly stayedin the original context of communication/storage of randomprocess data, while work with has mostly stayed in thecontext of questions in mathematical analysis: the Kolmogoroventropy numbers control the boundedness of Gaussian pro-cesses [35] and the properties of certain operators [10], [36],of convex sets [78], and of statistical estimators [6], [67].

At the general level which Kolmogorov proposed, almostnothing useful can be said about the structure of an optimal-net, nor is there any principle like mutual information which

could be used to derive a formal expression for the cardinalityof the -net.

However, there is a particularly interesting case in whichwe can say more. Consider the following typical setting for-entropy. Let be the circle and let

denote the collection of all functionssuch that Such functions arecalled “differentiable in quadratic mean” or “differentiable inthe sense of H. Weyl.” For approximating functions of thisclass in -norm, we have the precise asymptotics of theKolmogorov -entropy [34]

(5.3)

A transform-domain coder can achieve this asymptotic.One divides the frequency domain into subbands, definedexactly as in (4.1) and (4.2). Then one takes the Fouriercoefficients of the object , obtaining blocks of coefficients

Treating these coefficients now as if they were arbitrarymembers of spheres of radius , one encodesthe coefficients using an -net for the sphere of radiusOne represents the objectby concatenating a prefix codetogether with the code for the individual subbands. The prefixcode records digital approximations to the pairs forsubbands, and requires asymptotically a small number of bits.The body code simply concatenates the codes for each of theindividual subbands. With the right fidelity allocation—i.e.,

choice of —the resulting code has a length described by theright side of (5.3).

VI. THE JPEG SETTING

We now discuss the emergence of transform coding ideas inpractical coders. The discrete-time setting of practical codersmakes it expedient to abandon the functional point of viewthroughout this section, in favor of a viewpoint based onsampled data.

A. History

Transform coding plays an important role for images andaudio compression where several successful standards incor-porate linear transforms. The success and wide acceptance oftransform coding in practice is due to a combination of factors.The Karhunen–Loeve transform and its optimality under some(restrictive) conditions form a theoretical basis for transformcoding. The wide use of particular transforms like the discretecosine transform (DCT) led to a large body of experience,including design of quantizers with human perceptual criteria.But most importantly, transform coding using a unitary matrixhaving a fast algorithm represents an excellent compromise interms of computational complexity versus coding performance.That is, for a given cost (number of operations, run time,silicon area), transform coding outperforms more sophisticatedschemes by a margin.

The idea of compressing stochastic processes using a lineartransformation dates back to the 1950’s [63], when signalsoriginating from a vocoder were shown to be compressibleby a transformation made up of the eigenvectors of thecorrelation matrix. This is probably the earliest use of theKarhunen–Loeve transform (KLT) in data compression. Then,in 1963, Huang and Schultheiss [55] did a detailed analysisof block quantization of random variables, including bit allo-cation. This forms the foundation of transform coding as usedin signal compression practice. The approximation of the KLTby trigonometric transforms, especially structured transformsallowing a fast algorithm, was done by a number of authors,leading to the proposal of the discrete cosine transform in1974 [1]. The combination of discrete cosine transform, scalarquantization, and entropy coding was studied in detail forimage compression, and then standardized in the late 1980’sby the joint picture experts group (JPEG), leading to the JPEGimage compression standard that is now widely used. In themeantime, another generation of image coders, mostly basedon wavelet decompositions and elaborate quantization andentropy coding, are being considered for the next standard,called JPEG-2000.


B. The Standard Model and the Karhunen–Lo`eve Transform

The structural facts described in Section II, concerningfor Gaussian random processes, become very simple

in the case of Gaussian random vectors. Compression of avector of correlated Gaussian random variables factors into alinear transform followed by independent compression of thetransform coefficients.

Consider a size vector ofzero mean random variables and thevector of random variables after transformation by, or

Define andas autocovariance matrices of and , respectively. Since

is symmetric and positive-semidefinite, there is a fullset of orthogonal eigenvectors with nonnegative eigenvalues.The Karhunen–Lo`eve transform matrix is defined as thematrix of unit-norm eigenvectors of ordered in terms ofdecreasing eigenvalues, that is,

where (for simplicity, we will assume that). Clearly, transforming with will diagonalize

The KLT satisfies a best linear approximation property in themean-squared error sense which follows from the eigenvectorchoices in the transform. That is, if only a fixed subset of thetransform coefficients are kept, then the best transform is theKLT.

The importance of the KLT for compression comes fromthe following standard result from source coding [46]. A size-

Gaussian vector source with correlation matrix andmean zero is to be coded with a linear transform. Bits areallocated optimally to the transform coefficients (using reversewaterfilling). Then the transform that minimizes the MSE inthe limit of fine quantization of the transform coefficients isthe Karhunen–Loeve transform The coding gain due tooptimal transform coding over straight PCM coding is

(6.1)

where we used Recalling that the variancesare the eigenvalues of , it follows that the coding gain is theratio of the arithmetic and geometric means of the eigenvaluesof the autocorrelation matrix.

Using reverse waterfilling, the above construction can beused to derive the function of i.i.d. Gaussian vectors[19]. However, an important point is that in a practicalsetting and for complexity reasons, only scalar quantizationis used on the transform coefficients (see Fig. 1). The highrate scalar distortion rate function (with entropy coding) fori.i.d. Gaussian samples of variance is given by

while the Shannon distortion rate function

Fig. 1. Transform coding system, whereT is a unitary transform,Qi arescalar quantizers, andEi are entropy coders.

is (using block coding). This means that apenalty of about a quarter bit per sample is paid, a small priceat high rates or small distortions.

C. The Discrete Cosine Transform

To make the KLT approach to block coding operationalrequires additional steps. Two problems need to be addressed:the signal dependence of the KLT (finding eigenvectors ofthe correlation matrix), and the complexity of computing theKLT ( operations). Thus fast fixed transforms (with about

operations) leading to approximate diagonalizationof correlation matrices are used. The most popular amongthese transforms is the discrete cosine transform, which hasthe property that it diagonalizes approximately the correla-tion matrix of a first-order Gauss–Markov process with highcorrelation , and also the correlation matrix of anarbitrary Gauss–Markov process (with correlation of sufficientdecay, ) and block sizes TheDCT is closely related to the discrete Fourier transform, andthus can be computed with a fast Fourier transform likealgorithm in operations. This is a key issue: theDCT achieves a good compromise between coding gain orcompression, and computational complexity. Therefore, for agiven computational budget, it can actually outperform theKLT [49].

VII. T HE COMMON GAUSSIAN MODEL

At this point we have looked at three different settings inwhich we can interpret the phrase “data compression.” In eachcasewe have available a library of instances which we wouldlike to represent with few bits.

• a) In Section II (on theory) we are considering theinstances to be realizations of Gaussian processes. Thelibrary is the collection of all such realizations.

• b) In Section V (on -entropy) we are considering theinstances to be smooth functions. The libraryis thecollection of such smooth functions obeying the constraint


• c) In Section VI (on JPEG), we are considering the in-stances to be existing or future digital images. The libraryis implicitly the collection of all images of potentialinterest to the JPEG user population.

We can see a clear similarity in the coding strategy usedin each setting.

• Transform into the frequency domain.

• Break the transform into homogeneous subbands.

• Apply simple coding schemes to the subbands.

Why the common strategy of transform coding?The theoretically tightest motivation for transform coding

comes from Shannon’s theory, which tells us that inorder to best encode a Gaussian process, one should transformthe process realization to the Karhunen–Loeve domain, wherethe resulting coordinates are independent random variables.This sequence of independent Gaussian variables can be codedby traditional schemes for coding discrete memoryless sources.

So when we use transform coding in another setting—-entropy or JPEG—it appears that we arebehaving as if that

setting could be modeled by a Gaussian process.In fact, it is sometimes said that the JPEG scheme is

appropriate for real image data because if image data werefirst-order Gauss–Markov, then the DCT would be approxi-mately the Karhunen–Loeve transform, and so JPEG wouldbe approximately following the script of the story.Implicitly, the next statement is “and real image data behavesomething like first-order Gauss–Markov.”

What about -entropy? In that setting there is no obvious“randomness,” so it would seem unclear how a connectionwith Gaussian processes could arise. In fact, a proof of(5.3) can be developed by exhibiting just such a connection[34]; one can show that there are Gaussian random functionswhose sample realizations obey, with high probability, theconstraint and for whichthe theory of Shannon accurately matches the numberof bits required, in the Kolmogorov theory, to representwithin a distortion level (i.e., the right side of (5.3)). TheGaussian process with this property is a process naturallyassociated with the class obeying the indicated smoothnessconstraint—the least favorable process for Shannon data-compression; a successful Kolmogorov-net for the processwill be effectively a successful Shannon codebook for theleast favorable process. So even in the Kolmogorov case,transform coding can be motivated by recourse to theoryfor Gaussian processes, and to the idea that the situationcan be modeled as a Gaussian one. (That a method derivedfrom Gaussian assumptions helps us in other cases may seemcurious. This is linked to the fact that the Gaussian appearsas a worst case scenario. Handling the worst case well willoften lead to adequate if not optimal performance for morefavorable cases.)

VIII. G ETTING THE MODEL RIGHT

We have so far considered only a few settings in whichdata compression could be of interest. In the context of

theory, we could be interested in complex non-Gaussianprocesses; in theory we could be interested in functionalclasses defined by norms other than those based on; inimage coding we could be interested in particular imagecompression tasks, say specialized to medical imagery, or tosatellite imagery.

Certainly, the simple idea of Fourier transform followed byblock i.i.d. Gaussian coding cannot be universally appropriate.As the assumptions about the collection of instances to berepresented change, presumably the correspondingoptimalrepresentation will change. Hence it is important to explorea range of modeling assumptions and to attempt to get theassumptions right! Although Shannon’s ideas have been veryimportant in supporting diffusion of frequency-domain codingin practical lossy compression, we feel that he himself wouldhave been the first to suggest a careful examination of theunderlying assumptions, and to urge the formulation of betterassumptions. (See, for instance, his adhortations in [85].)

In this section we consider a wider range of models for thelibraries of instances to be compressed, and see how alternativerepresentations emerge as useful.

A. Some Non-Gaussian Models

Over the last decade, studies of the statistics of naturalimages have repeatedly shown the non-Gaussian character ofimage data. While images make up only one application areafor data compression, the evidence is quite interesting.

Empirical studies of wavelet transforms of images, consid-ering histograms of coefficients falling in a common subband,have uncovered markedly non-Gaussian structure. As notedby many people, subband histograms are consistent withprobability densities having the form , wherethe exponent “” would be “ ” if the Gaussian case applied,but where one finds radically different values of “” inpractice; e.g., Simoncelli [87] reports evidence for Infact, such generalized Gaussian models have been long usedto model subband coefficients in the compression literature(e.g., [101]). Field [37] investigated the fourth-order cumulantstructure of images and showed that it was significantlynonzero. This is far out of line with the Gaussian model, inwhich all cumulants of order three and higher vanish.

In later work, Field [38] proposed that wavelet transforms ofimages offered probability distributions which were “sparse.”A simple probability density with such a sparse character isthe Gaussian scale mixture , where

and are both small positive numbers; this corresponds todata being of one of two “types:” “small,” the vast majority,and “large,” the remaining few. It is not hard to understandwhere the two types come from: a wavelet coefficient can belocalized to a small region which contains an edge, or whichdoes not contain an edge. If there is no edge in the region, itwill be “small;” if there is an edge, it will be “large.”

Stationary Gaussian models are very limited and are unableto duplicate these empirical phenomena. Images are bestthought of as spatially stationary stochastic processes, sincelogically the position of an object in an image is ratherarbitrary, and a shift of that object to another position would


produce another equally valid image. But if we impose sta-tionarity on a Gaussian process we cannot really exhibit bothedges and smooth areas. A stationary Gaussian process mustexhibit a great deal of spatial homogeneity. From results in themean-square calculus we know that if such a process is mean-square-continuous at a point, it is mean-square-continuousat every point. Clearly, most images will not fit this modeladequately.

Conditionally Gaussian models offer an attractive way tomaintain ties with the Gaussian case while exhibiting globallynon-Gaussian behavior. In such models, image formation takesplace in two stages. An initial random experiment lays downregions separated by edges, and then in a subsequent stageeach region is assigned a Gaussian random field.

Consider a simple model of random piecewise-smooth func-tions in dimension one, where the piece boundaries are throwndown at random, say by a Poisson process, the pieces are real-izations of (different) Gaussian processes (possibly stationary),and discontinuities are allowed across the boundaries of thepieces [12]. This simple one-dimensional model can replicatesome of the known empirical structure of images, particularlythe sparse histogram structure of wavelet subbands and thenonzero fourth-order cumulant structure.

B. Adaptation, Resource Allocation, and Nonlinearity

Unfortunately, when we leave the domain of Gaussianmodels, we lose the ability to compute in such greatgenerality. Instead, we begin to operate heuristically. Suppose,for example, we employ a conditionally Gaussian model.There is no general solution for for such a class; but itseems reasonable that the two-stage structure of the modelgives clues about optimal coding; accordingly, one mightsuppose that an effective coder should factor into a part thatadapts to the apparent segmentation structure of the imageand a part that acts in a traditional way conditional on thatstructure. In the simple model of piecewise-smooth functionsin dimension one, it is clear that coding in long blocks isuseful for the pieces, and that the coding must be adaptedto the characteristics of the pieces. However, discontinuitiesmust be well-represented also. So it seems natural that oneattempts to identify an empirically accurate segmentation andthen adaptively code the pieces. If transform coding ideas areuseful in this setting, it might seem that they would play arole subordinate to the partitioning—i.e., appearing only inthe coding of individual pieces. It might seem that applyinga single global orthogonal transform to the data is simply notcompatible with the assumed two-stage structure.

Actually, transform codingis able to offer a degree of adap-tation to the presence of a segmentation. The wavelet transformof an object with discontinuities will exhibit large coefficientsin the neighborhood of discontinuities, and, at finer scales,will exhibit small coefficients away from discontinuities. Ifone designs a coder which does well in representing such“sparse” coefficient sequences, it will attempt to representall the coefficients at coarser scales, while allocating bitsto represent only those few big coefficients at finer scales.Implicitly, coefficients at coarser scales represent the structureof pieces, and coefficients at finer scales represent discontinu-

ities between the pieces. The resource allocation is thereforeachieving some of the same effect as an explicit two-stageapproach.

Hence, adaptivity to the segmentation can come from ap-plying a fixed orthogonal transform together with adaptiveresource allocation of the coder. Practical coding experiencesupports this. Traditional transform coding of i.i.d. Gaussianrandom vectors at high rate assumes a fixed rate alloca-tion per symbol, but practical coders, because they work atlow rate and use entropy coding, typically adapt the codingrate to the characteristics of each block. Specific adaptationmechanisms, using context modeling and implicit or explicitside-information are also possible.

Adaptive resource allocation with a fixed orthogonal trans-form is closely connected with a mathematical procedurewhich we will explore at length in Sections XIV and XV: non-linear approximation using a fixed orthogonal basis. Supposethat we have an orthogonal basis and we wish to approximatean object using only basis functions. In traditional linearapproximation, we would consider using the first-basisfunctions to form such an approximation. In nonlinear approx-imation, we would consider using the best-basis functions,i.e., to adaptively select the terms which offer the bestapproximation to the particular object being considered. Thisadaptation is a form of resource allocation, where the resourcesare the terms to be used. Because of this connection, we willbegin to refer to “the nonlinear nature of the approximationprocess” offered by practical coders.

C. Variations on Stochastic Process Models

To bring home the remarks of the last two subsections, weconsider some specific variations on the stochastic processmodels of Section II. In these variations, we will considerprocesses that are non-Gaussian; and we will compare usefulcoding strategies for those processes with the coding strategiesfor the Gaussian processes having the same second-orderstatistics.

• Spike Process. For this example, we briefly leave thefunctional viewpoint.

Consider the following simple discrete-time randomprocess, generating a single “spike.” Let

where is uniformlydistributed between and and is Thatis, after picking a random location, one puts a Gaussianrandom variable at that location. The autocorrelationis equal to , thus the KLT is the identitytransformation. Allocating bits to each coefficientleads to a distortion of order for the singlenonzero coefficient. Hence the distortion-rate functiondescribing the operational performance of the Gaussiancodebook coder in the KLT domain has

Here the constantdepends on the quantization and codingof the transform coefficients.

An obvious alternate scheme at high rates is to spendbits to address the nonzero coefficient, and use


Fig. 2. (a) Realization ofRamp. (b) Wavelet and (c) DCT coefficients. (d) Rearranged coefficients. (e) Nonlinear approximation errors (14.2). (f) Operatingperformance curves of scalar quantization coding.

the remaining bits to represent the Gaussianvariable. Thisposition-indexingmethod leads to

This coder relies heavily on the non-Gaussian character ofthe joint distribution of the entries in , and for

this non-Gaussian coding clearly outperformsthe former, Gaussian approximation method. While thisis clearly a very artificial example, it makes the point thatif location (or phase) is critical, then time-invariant, linearmethods like the KLT followed by independent coding ofthe transform coefficients are suboptimal.

Images are very phase critical: edges are among themost visually significant features of images, and thusefficient position coding is of essence. So it comes as nosurprise that some nonlinear approximation ideas made itinto standards, namely, ideas where addressing of largecoefficients is efficiently solved.

• Ramp Process. Yves Meyer [75] proposed the followingmodel. We have a process defined on througha single random variable uniformly distributed onby

This is a very simple process, and very easy to codeaccurately. A reasonable coding scheme would be toextract by locating the jump of the process and thenquantizing it to the required fidelity.

On the other hand,Ramp is covariance equivalent tothe Brownian Bridge process which we mentionedalready in Section II, the Gaussian zero-mean process on

with covariance

An asymptotically -optimal approach to codingBrownian Bridge can be based on the Karhunen–Loevetransform; as we have seen, in this case the sine transform.One takes the sine transform of the realization, breaks thesequence of coefficients into subbands obeying (4.1) and(4.2) and then treats the coefficients, in subbands, exactlyas in discrete memoryless source compression.

Suppose we ignored the non-Gaussian character of theRamp process and simply applied the same coder wewould use for Brownian Bridge. After all, the two arecovariance-equivalent. This would result in orders of mag-nitude more bits than necessary. The coefficients in thesine transform ofRamp are random; their typical sizeis measured in mean square by the eigenvalues of thecovariance—namely, In order to accu-rately represent the Ramp process with distortion, wemust code the first coefficients, at ratesexceeding 1 bit per coefficient. How many coefficientsdoes it take to represent a typical realization ofRampwitha relative error of 1%? About

On the other hand, as Meyer pointed out, the waveletcoefficients ofRampdecay very rapidly, essentially ex-ponentially. As a result, very simple scalar quantizationschemes based on wavelet coefficients can capture real-izations of Rampwith 1% accuracy using a few dozensrather than tens of thousands of coefficients, and witha corresponding advantage at the level of bits; this isillustrated in Fig. 2.

The point here is that if we pay attention to second-order statistics only, and adopt an approach that wouldbe good under a Gaussian model, we may pay ordersof magnitude more bits than would be necessary for


coding the process under a more appropriate model. Byabandoning the Karhunen–Loeve transform in this non-Gaussian case, we get a transform in which very simplescalar quantization works very well.

Note that we could in principle build a near-optimalscheme by transform coding with a coder based on Fouriercoefficients, but we would have to apply a much morecomplex quantizer; it would have to be a vector quantizer.(Owing to Littlewood–Paley theory described later, it ispossible to say what the quantizer would look like; itwould involve quantizing coefficients near wavenumber

in blocks of size roughly This is computationallyimpractical.)

D. Variations on Function Class Models

In Section V, we saw that subband coding of Fourier coef-ficients offered an essentially optimal method, under the Kol-mogorov -entropy model, of coding objectsknowna priorito obey smoothness constraints

While this may not be apparent to outsiders, there aremajor differences in the implications of various smoothnessconstraints. Suppose we maintain the distortion measure,but make the seemingly minor change from the form ofconstraint to an form, with

This can cause major changes in what constitutes anunderlying optimal strategy. Rather than transform coding inthe frequency domain, we can find that transform coding inthe wavelet domain is appropriate.

Bounded Variation Model:As a simple example, considerthe model that the object under consideration is a function

of a single variable that is of bounded variation. Suchfunctions can be interpreted as having derivatives which aresigned measures, and then we measure the norm by

The important point is such can have jump discontinuities,as long as the sum of the jumps is finite. Hence, the class offunctions of bounded variation can be viewed as a model forfunctions which have discontinuities; for example, a scan linein a digital image can be modeled as a typical BV function.

An interesting fact about BV functions is that they can beessentially characterized by their Haar coefficients. The BVfunctions with norm obey an inequality

where are the Haar wavelet expansion coefficients. It isalmost the case that every function that obeys this constraintis a BV function. This says that geometrically, the class ofBV functions with norm is a convex set inscribed in afamily of balls.

An easy coder for functions of Bounded Variation can bebased on scalar quantization of Haar coefficients. However,

scalar quantization of the Fourier coefficients would not worknearly as well; as the desired distortion , the numberof bits for Fourier/scalar quantization coding can be orders ofmagnitude worse than the number of bits for wavelet/scalarquantization coding. This follows from results in Sections XVand XVI below.

E. Variations on Transform Coding and JPEG

When we consider transform coding as applied to empiricaldata, we typically find that a number of simple variations canlead to significant improvements over what the strict Gaussian

theory would predict. In particular, we see that whengoing from theory to practice, KLT as implemented in JPEGbecomes nonlinear approximation!

The image is first subdivided into blocks of size by( is typically equal to or ) and these blocks are treatedindependently. Note that blocking the image into independentpieces allows to adapt the compression to each block individ-ually. An orthonormal basis for the two-dimensional blocksis derived as a product basis from the one-dimensional DCT.While not necessarily best, this is an efficient way to generatea two-dimensional basis.

Now, quantization and entropy coding is done in a mannerthat is quite at variance with the classical setup. First, basedon perceptual criteria, the transform coefficient isquantized with a uniform quantizer of stepsize Typically,

is small for low frequencies, and large for high ones,and these stepsizes are stored in a quantization matrixTechnically, one could pick different quantization matrices fordifferent blocks in order to adapt, but usually, only a singlescale factor is used to multiply , and this scale factorcan be adapted depending on the statistics in the block. Thusthe approximate representation of the th coefficient is

whereThe quantized variable is discrete with a finite numberof possible values ( is bounded) and is entropy-coded.

Since there is no natural ordering of the two-dimensionalDCT plane, yet known efficient entropy coding techniqueswork on one-dimensional sequences of coefficients, a pre-scribed 2D to 1D scanning is used. This so-called “zig-zag”scan traverses the DCT frequency plane diagonally from lowto high frequencies. For this resulting one-dimensional length-

sequence, nonzero coefficients are entropy-coded, andstretches of zero coefficients are encoded using entropy codingof run lengths. Anend-of-block (EOB) symbol terminatesa sequence of DCT coefficients when only zeros are left(which is likely to arrive early in the sequence when coarsequantization is used).

Let us consider two extreme modes of operation: In the firstcase, assume very fine quantization. Then, many coefficientswill be nonzero, and the behavior of the rate–distortion tradeoffis dominated by the quantization and entropy coding of theindividual coefficients, that is, This mode isalso typical for high variance regions, like textures.

In the second case, assume very coarse quantization. Then,many coefficients will be zero, and the run-length coding isan efficient indexing of the few nonzero coefficients. We are


Fig. 3. Performance of real transform coding systems. The logarithm of the MSE is shown for JPEG (top) and SPIHT (bottom). Above about 0.5 bit/pixel,there is the typical�6 dB per bit slope, while at very low bit rate, a much steeper slope is achieved.

in a nonlinear approximation case, since the image block isapproximated with a few basis vectors corresponding to largeinner products. Then, the behavior is very different,dominated by the faster decay of ordered transform coeffi-cients, which in turn is related to the smoothness class of theimages. Such a behavior is also typical for structured regions,like smooth surfaces cut by edges, since the DCT coefficientswill be sparse.

These two different behaviors can be seen in Fig. 3, wherethe logarithm of the distortion versus the bit rate per pixelis shown. The slope above about 0.5 bit/pixel is clear,as is the steeper slope below. An analysis of the low-ratebehavior of transform codes has been done recently by Mallatand Falzon [70]; see also related work in Cohen, Daubechies,Guleryuz, and Orchard [14].

IX. GOOD MODELS FOR NATURAL DATA?

We have now seen, by considering a range of differentintellectual models for the class of objects of interest to us,that depending on the model we adopt, we can arrive at verydifferent conclusions about the “best” way to represent orcompress those objects. We have also seen that what seemslike a good method in one model can be a relatively poormethod according to another model. We have also seen thatexisting models used in data compression are relatively poordescriptions of the phenomena we see in natural data. Wethink that we may still be far away from achieving an optimalrepresentation of such data.

A. How Many Bits for Mona Lisa?

A classic question, somewhat tongue in cheek, is: how manybits do we need to describe Mona Lisa? JPEG uses 187 Kbytesin one version. From many points of view, this is far more thanthe number intrinsically required.

Humans will recognize a version based on a few hundredbits. An early experiment by L. Harmon of Bell Laboratoriesshows a recognizable Abraham Lincoln at 756 bits, a trick alsoused by S. Dali in his painting “Slave Market with InvisibleBust of Voltaire.”

Another way to estimate the number of bits in a representa-tion is to consider an index of every photograph ever taken inthe history of mankind. With a generous estimate of 100 billionpictures a year, the 100 years of photography need an index ofabout 44 bits. Another possibility yet is to index all picturesthat can possibly be viewed by all humans. Given the worldpopulation, and the fact that at most 25 pictures a second arerecognizable, a hundred years of viewing is indexed in about69 bits.

Given that the Mona Lisa is a very famous painting, it isclear that probably a few bits will be enough (with the obviousvariable length code: [is it Lena?, is it Mona Lisa?, etc.]).Another approach is the interactive search of the image, forexample, on the Web. A search engine prompted with a fewkey words will quickly come back with the answer at the top ofthe following page, and just a few bytes have been exchanged.

These numbers are all very suggestive when we considerestimates of the information rate of the human visual system.


Barlow [4] summarizes evidence that the many layers ofprocessing in the human visual system reduce the informationflow from several megabits per second at the retina to about40 bits per second deep in the visual pathway.

From all points of view, images ought to be far morecompressible than current compression standards allow.

B. The Grand Challenge

An effort to do far better compression leads to the GrandChallenge, items GC1–GC3 of Section I. However, to addressthis challenge by orthodox application of the Shannon theoryseems to us hopeless. To understand why, we make threeobservations.

• Intrinsic Complexity of Natural Data Sources:An accu-rate model for empirical phenomena would be of po-tentially overwhelming complexity. In effect, images,or sounds, even in a restricted area of application likemedical imagery, are naturally infinite-dimensional phe-nomena. They take place in a continuum, and in principlethe recording of a sound or image cannot be constrainedin advance by a finite number of parameters. The trueunderlying mechanism is in many cases markedly non-Gaussian, and highly nonstationary.

• Difficulty of Characterization:There exists at the mo-ment no reasonable “mechanical” way to characterizethe structure of such complex phenomena. In the zero-mean Gaussian case, all behavior can be deduced fromproperties of the countable sequence of eigenvalues ofthe covariance kernel. Outside of the Gaussian case, verylittle is known about characterizing infinite-dimensionalprobability distributions which would be immediatelyhelpful in modeling real-world phenomena such as imagesand sounds. Instead, we must live by our wits.

• Complexity of Optimization Problem:If we take Shannonliterally, and apply the abstract principle, deter-mining the best way to code a naturally occurring sourceof data would require to solve a mutual informationproblem involving probability distributions defined on aninfinite-dimensional space. Unfortunately, it is not clearthat one can obtain a clear intellectual description ofsuch probability distributions in a form which would bemanageable for actually stating the problem coherently,much less solving it.

In effect, uncovering the optimal codebook structure ofnaturally occurring data involves more challenging empiricalquestions than any that have ever been solved in empiricalwork in the mathematical sciences. Typical empirical questionsthat have been adequately solved in scientific work to dateinvolve finding structure of very simple low-dimensional, well-constrained probability distributions.

The problem of determining the solution of the -problem, given a limited number of realizations of, could beconsidered a branch of what is becoming known in statisticsas “functional data analysis”—the analysis of data when theobservations are images, sounds, or other functions, and sonaturally viewed as infinite-dimensional. Work in that fieldaims to determine structural properties of the probability distri-bution of functional data—for example, the covariance and/orits eigenfunctions, or the discriminant function for testingbetween two populations. Functional data analysis has shownthat many challenging issues impede the extension of simplemultivariate statistical methods to the functional case [79].Certain simple multivariate procedures have been extended tothe functional case: principal components, discriminant anal-ysis, canonical correlations being carefully studied examples.The problem that must be faced in such work is that one hasalways a finite number of realizations from which one is to in-fer aspects of the infinite-dimensional probabilistic generatingmechanism. This is a kind of rank deficiency of the data setwhich means that, for example, one cannot hope to get quanti-tatively accurate estimates of eigenfunctions of the covariance.

The mutual information optimization problem in Shannon’sin the general non-Gaussian case requires far more

than just knowledge of a covariance or its eigenfunctions; itinvolves in principle all the joint distributional structure of theprocess. It is totally unclear how to deal with the issues thatwould crop up in such a generalization.

X. A ROLE FOR HARMONIC ANALYSIS

In this section, we comment on some interesting insightsthat harmonic analysis has to offer against the background ofthis “Grand Challenge.”

A. Terminology

The phrase “harmonic analysis” means many things tomany different people. To some, it is associated with anabstract procedure in group theory—unitary group representa-tions [54]; to others it is associated with classical mathematicalphysics—expansions in special functions related to certaindifferential operators; to others it is associated with “hard”analysis in its modern form [90].

The usual senses of the phrase all have roots in the boldassertions of Fourier that a) “any” function can be expandedin a series of sines and cosines and that, b) one couldunderstand the complex operator of heat propagation by un-derstanding merely its action on certain “elementary” initialdistributions—namely, initial temperature distributions follow-ing sinusoidal profiles. As is now well known, making sensein one way or another of Fourier’s assertions has spawnedan amazing array of concepts over the last two centuries; thetheories of the Lebesgue integral, of Hilbert spaces, of


spaces, of generalized functions, of differential equations, areall bound up in some way in the development, justification,and refinement of Fourier’s initial ideas. So no single writercan possibly mean “all of harmonic analysis” when using theterm “harmonic analysis.”

For the purposes of this paper, harmonic analysis refersinstead to a series of ideas that have evolved throughout thiscentury, a set of ideas involving two streams of thought.

On the one hand, to develop ways to analyze functions usingdecompositions built through a geometrically motivated “cut-ting and pasting” of time, frequency, and related domains into“cells” and the construction of systems of “atoms” “associatednaturally” to those “cells.”

On the other hand, to use these decompositions to find char-acterizations, to within notions of equivalence, of interestingfunction classes.

It is perhaps surprising that one can combine these twoideas.A priori difficult to understand classes of functions ina functional space turn out to have a characterization as asuperposition of “atoms” of a more or less concrete form. Ofcourse, the functional spaces where this can be true are quitespecial; the miracle is that it can be done at all.

This is a body of work that has grown up slowly over thelast ninety years, by patient accumulation of a set of toolsand cultural attitudes little known to outsiders. As we standat the end of the century, we can say that this body of workshows that there are many interesting questions about infinite-dimensional function classes where experience has shown thatit is often difficult or impossible to obtain exact results, butwhere fruitful analogies do suggest similar problems which are“good enough” to enable approximate, or asymptotic solutions.

In a brief survey paper, we can only superficially mention afew of the decomposition ideas that have been proposed anda few of the achievements and cultural attitudes that haveresulted; we will do so in Sections XII, XVII, and XVIIIbelow. The reader will find that [58] provides a wealth ofhelpful background material complementing the present paper.

B. Relevance to the Grand Challenge

Much of harmonic analysis in this century can be charac-terized as carrying out a three-part program

HA1 Identify an interesting class of mathematically definedobjects (functions, operators, etc.).

HA2 Develop tools to characterize the class of objects interms of functionals derivable from an analysis of the objectsthemselves.

HA3 Improve the characterization tools themselves, refin-ing and streamlining them.

This program, while perhaps obscure to outsiders, hasborne interesting fruit. As we will describe below, wavelettransforms arose from a long series of investigations into thestructure of classes of functions defined by constraints,

The original question was to characterize functionclasses by analytic means, such as by theproperties of the coefficients of the considered functionsin an orthogonal expansion. It was found that the obvious

first choice—Fourier coefficients—could not offer such acharacterization when , and eventually, after a long seriesof alternate characterizations were discovered, it was provedthat wavelet expansions offered such characterizations—i.e.,that by looking at the wavelet coefficients of a function, onecould learn the -norm to within constants of equivalence,

It was also learned that for nonorm characterization was possible, but after replacingby the closely related space and by the closelyrelated space (the space of functions of BoundedMean Oscillation), the wavelet coefficients again contained therequired information for knowing the norm to within constantsof equivalence.

The three parts HA1–HA3 of the harmonic analysis programare entirely analogous to the three steps GC1–GC3 in theGrand Challenge for data compression—except that for datacompression, the challenge is to deal withnaturally occurringdata sourceswhile for harmonic analysis the challenge is todeal withmathematically interesting classes of objects.

It is very striking to us that the natural development of har-monic analysis in this century, while in intention completelyunrelated to problems of data compression, has borne fruitwhich seems very relevant to the needs of the data compressioncommunity. Typical byproducts of this effort so far includethe fast wavelet transform, and lossy wavelet domain coderswhich exploit the “tree” organization of the wavelet transform.Furthermore, we are aware of many other developments inharmonic analysis which have not yet borne fruit of directimpact on data compression, but seem likely to have an impactin the future.

There are other insights available from developments inharmonic analysis. In the comparison between the three-partchallenge facing data compression and the three-part programof harmonic analysis, the messiness of understanding a naturaldata source—which requires dealing with specific phenomenain all their particularity—is replaced by the precision ofunderstanding a class with a formal mathematical definition.Thus harmonic analysis operates in a more ideal setting formaking intellectual progress; but sometimes progress is not ascomplete as one would like. It is accepted by now that manycharacterization problems of function classes cannot be exactlysolved. Harmonic analysis has shown that often one canmake substantial progress by replacing hard characterizationproblems with less demanding problems where answers canbe obtained explicitly. It has also shown that such cruderapproximations are still quite useful and important.

A typical example is the study of operators. The eigen-functions of an operator are fragile, and can change radicallyif the operator is only slightly perturbed in certain ways.It is in general difficult to get explicit representations ofthe eigenfunctions, and to compute them. Harmonic analysis,however, shows that for certain problems, we can work with“almost eigenfunctions” that “almost diagonalize” operators.For example, wavelets work well on a broad range of op-erators, and moreover, they lend themselves to concrete fastcomputational algorithms. That is, the exact problem, which ispotentially intractable, is replaced by an approximate problemfor which computational solutions exist.


C. Survey of the Field

In the remainder of this paper we will discuss a set ofdevelopments in the harmonic analysis community and howthey can be connected to results about data compression.We think it will become clear to the reader that in fact thelessons that have been learned from the activity of harmonicanalysis are very relevant to facing the Grand Challenge fordata compression.

XI. NORM EQUIVALENCE PROBLEMS

The problem of characterizing a class of functions, where is a norm of interest, has

occupied a great deal of attention of harmonic analysts in thiscentury. The basic idea is to relate the norm, defined in saycontinuum form by an integral, to an equivalent norm definedin discrete form

(11.1)

Here denotes a collection of coefficients arising ina decomposition of ; for example, these coefficients wouldbe obtained by

if the made up an orthonormal basis. The normis a norm defined on the continuum objectand the norm

is a norm defined on the corresponding discrete objectThe equivalence symbol in (11.1) means that there

are constants and , not depending on , so that

(11.2)

The significance is that the coefficientscontain within themthe information needed to approximately infer the size ofinthe norm One would of course usually prefer to have

, in which case the coefficients characterize the sizeof precisely, but often this kind oftight characterization isbeyond reach.

The most well-known and also tightest form of such a rela-tionship is the Parseval formula, valid for the continuum normof If the constitute acomplete orthonormal system for , then we have

(11.3)

Another frequently encountered relationship of this kind isvalid for functions on the circle , with theFourier basis of Section II. Then we have a norm equivalencefor a norm defined on the th derivative

(11.4)

These two equivalences are, of course, widely used in themathematical sciences. They are beautiful, but also potentiallymisleading. A naive reading of these results might promote theexpectation that one can frequently have tightness

in characterization results, or that the Fourier basis works inother settings as well.

We mention now five kinds of norms for which we mightlike to solve norm-equivalence, and for which the answersto all these norm-equivalence problems are by now well-understood.

• -norms:Let The -norm is, as usual, just

We can extend this scale of norms to by taking

• Sobolev-norms:Let The -Sobolev norm is

• Holder classes:Let The Holder classis the collection of continuous functionson the domain

with andfor some the smallest such is the norm. Let

, for integer ; the Holder classis the collection of continuous functionson the

domain with

for

• Bump Algebra:Suppose that is a continuous functionon the line Let be aGaussian normalized to height one rather than area one.Suppose that can be represented asfor a countable sequence of triples with

and Then is said to belongto the Bump Algebra, and its Bump norm is thesmallest value of for which such a decompositionexists. Evidently, a function in the Bump Algebra isa superposition of Gaussians, with various polarities,locations, and scales.

• Bounded Variation:Suppose that is a function on theinterval that is integrable and such that theincrement obeys

for The BV seminorm of is the smallestfor which this is true.

In each case, the norm equivalence problem is:Find anorthobasis and a discrete norm so that thenorm is equivalent to the discrete norm In-depth discussions of these spaces and their norm-equivalenceproblems can be found in [103], [89], [90], [96], [43], and[74]. In some cases, as we explain in Section XII below,the norm equivalence has been solved; in other cases, it hasbeen proven that there can never be a norm equivalence, buta closely related space has been discovered for which a normequivalence is available.

In these five problems, Fourier analysis does not work;that is, one cannot find a “simple” and “natural” norm onthe Fourier coefficients which provides an equivalent norm


to the considered continuous norm. It is also true that tightequivalence results, with , seem out of reach inthese cases.

The key point in seeking a norm equivalence is that thediscrete norm must be “simple” and “natural.” By this wereally mean that the discrete normshould depend only onthe size of the coefficientsand not on the signs or phasesof the coefficients. We will say that the discrete norm isunconditionalif it obeys the relationship

whenever with any sequence of weightsThe idea is that “shrinking the coefficients in size” should

“shrink” the norm.A norm equivalence result therefore requires discovering

both a representing system and a special norm on asequence space, one which is equivalent to the considered con-tinuum norm and also has the unconditionality property. Forfuture reference, we call a basis yielding a norm equivalencebetween a norm on function space and such an unconditionalnorm on sequence space anunconditional basis. It has theproperty that for any object in a function class, and any setof coefficients obeying

the newly constructed object

belongs to as well.There is a famous result dramatizing the fact that the Fourier

basis is not an unconditional basis for classes of continuousfunctions, due to DeLeeuw, Kahane, and Katznelson [25].Their results allows us to construct pairs of functions: one,,say, which is uniformly continuous on the circle; and the other,

, say, very wild, having square integrable singularities on adense subset of the circle. The respective Fourier coefficientsobey

In short, the ugly and bizarre objecthas thesmallerFouriercoefficients. Another way of putting this is that an extremelydelicate pattern in the phases of the coefficients, rather thanthe size of the coefficients, control the regularity of thefunction. The fact that special “conditions” on the coefficients,unrelated to their size, are needed to impose regularity mayhelp to explain the term “unconditional” and the preferencefor unconditional structure.

XII. H ARMONIC ANALYSIS AND NORM EQUIVALENCE

We now describe a variety of tools that were developedin harmonic analysis over the years in order to understandnorm equivalence problems, and some of the norm equivalenceproblems that were solved.

A. Warmup: The Sampling Theorem

A standard procedure by which interesting orthobases havebeen constructed by harmonic analysts is to first develop akind of “overcomplete” continuous representation, and laterdevelop a discretized variant based on a geometric model.

An elementary example of this procedure will be familiarto readers in the information theory community, as Shannon’sSampling Theorem [84] in signal analysis. Let be anfunction supported in a finite frequency interval ;and let be the time-domain representation. This representation is an-isometry,so that

The size of the function on the frequency side and on the timeside are the same, up to normalization. By our assumptions,the time-domain representation is a bandlimited function,which is very smooth, and so the representation ofin thetime domain is very redundant. A nonredundant representationis obtained by sampling, and retaining only the Themathematical expression of the perfect nonredundancy of thisrepresentation is the fact that we have the norm equivalence

(12.1)

and that there is an orthobasis of sampling functionsso that and

This time-domain representation has the following geometricinterpretation: there is a sequence of disjoint “cells” of length

, indexed by ; the samples summarize the behavior inthose cells; and the sampling functions provide the detailsof that behavior. While this circle of ideas was known toShannon and was very influential for signal analysis, weshould point out that harmonic analysts developed ideas suchas this somewhat earlier, under more general conditions, andbased on an exploration of a geometric model explaining thephenomenon. For example, work of Paley and Wiener in theearly 1930’s and of Plancherel and Polya in the mid 1930’sconcerned norm equivalence in the more general case when thepoints of sampling were not equispaced, and obtained methodsgiving equivalence for all norms,

(12.2)

provided the points are approximately equispaced at density

The results that the harmonic analysts obtained can beinterpreted as saying that the geometric model of the samplingtheorem has a very wide range of validity. For this geometricmodel, we define a collection of “cells” , namely, intervalsof length centered at the sampling points, and construct


the piecewise-constant object The normequivalence (12.2) says, in fact, that

(12.3)

for a wide range ofThis pattern—continuous representation, discretization, geo-

metric model—has been of tremendous significance in har-monic analysis.

B. Continuous Time-Scale Representations

At the beginning of this century, a number of interestingrelationships between Fourier series and harmonic functiontheory were discovered, which showed that valuable informa-tion about a function defined on the circle or the linecan be garnered from its harmonic extension into the interiorof the circle, respectively the upper half plane; in other words,by viewing as the boundary values of a harmonic function

This theory also benefits from an interplay with complexanalysis, since a harmonic function is the real part of ananalytic function The imaginary part iscalled the conjugate function of The harmonic function

and the associated analytic functiongive much importantinformation about We want to point out how capturing thisinformation on through the functions ultimately leadsto favorable decompositions of into fundamental buildingblocks called “atoms.” These decompositions can be viewed asa precursor to wavelet decompositions. For expository reasonswe streamline the story and follow [89, Chs. III, IV], [43, Ch.1], and [45]. Let be any “reasonable” function on the line.It has a harmonic extension into the upper half plane givenby the Poisson integral

(12.4)

where is the Poisson kernel. Thisassociates to a function of one variable the harmonicfunction of two variables, where the argumentagain rangesover the line and ranges over the positive reals. The physicalinterpretation of this integral is that is the boundary value of

and is what can be sensed at some “depth”Eachof the functions is infinitely differentiable. Whenever

the equal-depth sectionsconverge to as and, therefore, the norms converge aswell: We shall see next another wayto capture through the function Hardy in 1914developed an identity in the closely related setting where onehas a function defined on the unit circle, and one uses theharmonic extension into the interior of the unit disk [53]; henoticed a way to recover the norm of the boundary values fromthe norm of the values on the interior of the disk. By conformalmapping and some simple identities based on Green’s theorem,this is recognized today as equivalent to the statement that inthe setting (12.4) the norm of the “boundary values”can be recovered from the behavior of the whole function

(12.5)

Defining , this formula says

(12.6)

Now the object is itself an integral transform

where the kernel resultsfrom differentiating the Poisson kernel. This gives a methodof assigning, to a function of a single real variable, a functionof two variables, in an norm-preserving way. In addition,this association is invertible since, formally,

when is a “nice” function. In the 1930’s, Littlewood andPaley, working again in the closely related setting of functionsdefined on the circle, found a way to obtain information onthe norms of a function defined on the circle, from anappropriate analysis of its values inside the circle. This is nowunderstood as implying that contains not just informationabout the norm as in (12.6), but also on norms,Defining , Littlewood–Paleytheory in its modern formulation says that

(12.7)

for The equivalence (12.7) breaks down when(We shall not discuss the other problem point

here since these -spaces are not separable and therefore donot have the series representations we seek.) To understandthe reason for this breakdown one needs to examine theconjugate function of mentioned above. It enjoys thesame properties of provided It has boundaryvalues and the function (called the conjugate functionof ) is also in But the story takes a turn for theworse when : for a function , its conjugatefunction need not be in The theory of the real Hardyspaces is a way to repair the situation and better understandthe norm equivalences (12.7). A functionis said to be inreal if and only if both and its conjugatefunction are in ; the norm of in is the sum of the

norms of and Replacing by on the leftside of (12.7), we obtain equivalences with absolute constantsthat hold even for In summary, the spaces area natural replacement for when discussing representationsand norm equivalences.

In modern parlance, the object would be called aninstance ofcontinuous wavelet transform. This terminologywas used for the first time in the 1980’s by Grossmann andMorlet [52], who proposed the study of the integral transform

(12.8)


where

and is a so-called “wavelet,” which must be chosen to obeyan admissibility condition; for convenience, we shall requirehere a special form of this condition

(12.9)

Here is a location parameter and is a scale parameter,and the transform maps into a time-scale domain. (Under adifferent terminology, a transform and admissibility conditionof the same type can also be found in [2].) The wavelettransform with respect to an admissible wavelet is invertible

(12.10)

and also an isometry

(12.11)

One sees by simple comparison of terms that the choiceyields under the calibration

and We thus gain a new interpretation of: thefunction has “scale” and so we can interpretas providing an association of the objectwith a certain“time-scale” portrait.

The continuous wavelet transform is highly redundant, asthe particular instance shows: it is harmonic in the upperhalf-plane. Insights into the redundancy of the continuouswavelet transform are provided (for example) in [23]. Supposewe associate to the point the rectangle

then the information “near ” in the wavelet transform isweakly related to information near if the correspondingrectangles are well-separated.

C. Atomic Decomposition

As pointed out earlier, it is natural to replace the studyof the spaces with that of the spaces ; in particular,we avoid certain unsatisfactory aspects of, which doesnot have any unconditional basis, and which behaves quiteunlike its logically neighboring spaces for The normequivalence (12.7), which does not work at , is thenreplaced by

valid for allIn the late 1960’s and early 1970’s, one of the most

successful areas of research in harmonic analysis concernedand associated spaces. At that time, the concept of “atomic

decomposition” arose, the key point was the discovery byFefferman that one can characterize membership in the space

precisely by the properties of its atomic decomposition

into atoms obeying various size, oscillation, and supportconstraints. Since then “atomic decomposition” has cometo mean a decomposition of a function into a discretesuperposition of nonrigidly specified pieces, where the piecesobey various analytic constraints (size, support, smoothness,moments) determined by a space of interest, with the sizeproperties of those pieces providing the characterization ofthe norm of the function [16], [29], [42], [43].

The continuous wavelet transform gives a natural tool tobuild atomic decompositions for various spaces. Actually,the tool predates the wavelet transform, since already in the1960’s Calderon [9] established a general decomposition, a“resolution of identity operator” which in wavelet terms canbe written

The goal is not just to write the identity operator in a morecomplicated form; by decomposing the integration domainusing a partition into disjoint sets, one obtains a familyof nontrivial operators, corresponding to different time-scaleregions, which sum to the identity operator.

Let now denote adyadic interval, i.e., an interval of theform with and integers, andlet denote a time-scale -rectangle sitting “above”

The collection of all dyadic intervals is obtained as thescale index runs through the integers (both positive andnegative) and the position indexruns through the integers aswell. The corresponding collection of all rectanglesforms a disjoint cover of the whole plane; compareFig. 4. If we now take the Calder´on reproducing formula andpartition the range of integration using this family of rectangleswe have formally that , where

Here is an operator formally associating tothat “piece”coming from time-scale region In fact, it makes senseto call a time-scale atom [43]. The region ofthe wavelet transform, owing to the redundancy properties,constitutes in some sense a minimal coherent piece of thewavelet transform. The corresponding atom summarizes thecontributions of this coherent region to the reconstruction, andhas properties one would consider natural for such a summary.If is supported in , then will be supported in ,the interval with same center asbut three times its width.Also, if is smooth and admissible then will be smooth,oscillating only as much as required to be supported inFor example, if is -times differentiable, and the wavelet

is chosen appropriately,

(12.12)

and we cannot expect a better dependence of the propertiesof an atom on in general. So the formuladecomposes into a countable sequence of time-scale atoms.


Fig. 4. A tiling of the time-scale plane by rectangles.

This analysis into atoms is informative about the propertiesof ; for example, suppose that “is built from many largeatoms at fine scales” then from (12.12) cannot be verysmooth. As shown in [43], one can express a variety offunction space norms in terms of the size properties of theatoms; or, more properly, of the continuous wavelet transformassociated to the “cells” For example, if our measure ofthe “size” of the atom is the “energy” of the “associatedtime-scale cell”

then

D. Unconditional Basis

By the end of the 1970’s, atomic decomposition methodsgave a satisfactory understanding of spaces; but in theearly 1980’s, the direction of research turned toward estab-lishing a basis for these spaces. B. Maurey had shown byabstract analysis that an unconditional basis ofmust exist,and Carleson had shown a way to construct such a basis by.in a sense. repairing the lack of smoothness of the Haar basis.

Let denote the Haar function;associate a scaled and translated version ofto each dyadicinterval according toThis collection of functions makes a complete orthonormalsystem for In particular, This issimilar in some ways to the atomic decomposition idea of thelast section—it is indexed by dyadic intervals and associates adecomposition to a family of dyadic intervals. However, it isbetter in other ways, since the are fixed functions rather

than atoms. Unfortunately, while it gives an unconditionalbasis for , it does not give an unconditional basis formost of the spaces where Littlewood–Paley theory applies;the discontinuities in the are problematic.

Carleson’s attempt at ‘repairing the Haar basis’ stimulatedStromberg (1982), who succeeded in constructing an orthog-onal unconditional basis for spaces withIn effect, he showed how, for any , to construct afunction so that the functionswere orthogonal and constituted an unconditional basis forToday we would call such a a wavelet; in fact, Stromberg’s

is a spline wavelet—a piecewise polynomial—andconstituted the first orthonormal wavelet basis with smoothelements. In [92], Stromberg testifies that at the time that hedeveloped this basis he was not aware of, or interested in,applications outside of harmonic analysis. Stromberg’s workwas published in a conference proceedings volume and wasnot widely noticed at the time.

In the mid-1980’s, Yves Meyer and collaborators becamevery interested in the continuous wavelet transform as de-veloped by Grossmann and Morlet. They first built frameexpansions, which are more discrete than the continuouswavelet transform and more rigid than the atomic decompo-sition. Then, Meyer developed a bandlimited function—theMeyer wavelet—which generated an orthonormal basis for

having elements which were infinitely differentiableand decayed rapidly at Lemarie and Meyer [64] thenshowed that this offered an unconditional basis for a very widerange of spaces: all the -Sobolev, , of all orders

; all the Holder , and more generally, allBesov spaces. Frazier and Jawerth have shown that the Meyerbasis offers unconditional bases for all the spaces in the Triebelscale; see [43].


E. Equivalent Norms

The solution of these norm-equivalence problems requiredthe introduction of two new families of norms on sequencespace; the general picture is due to Frazier and Jawerth [42],[43].

The first family is the (homogeneous)Besov sequencenorms. With a shorthand for the coefficient arisingfrom the dyadic interval ,

with an obvious modification if either or Thesenorms first summarize the sizes of the coefficients at allpositions with a single scale using an norm and thensummarize across scales, with an exponential weighting.

The second family involves the (homogeneous)Triebelsequence norms. With the indicator of ,

with an obvious modification if and a special modifi-cation if (which we do not describe here; this has to dowith the space BMO). These norms first summarize the sizesof the coefficients across scales, and then summarize acrosspositions.

The reader should note that the sequence space norm expres-sions have the unconditionality property: they only involve thesizes of the coefficients. If one shrinks the coefficients in sucha way that they become term-wise smaller in absolute values,then these norms must decrease.

We give some brief examples of norm equivalences usingthese families. Recall the list of norm equivalence problemslisted in Section XI. Our first two examples use the Besovsequence norms.

• (Homogeneous) H¨older Space : For , thenorm is the smallest constant so that

To get an equivalent norm, use an orthobasis built from(say) Meyer wavelets, and measure the norm of thewavelet coefficients by the Besov norm with

This reduces to a very simple expression, namely,

(12.13)

In short, membership in requires that the waveletcoefficientsall decay like an appropriate power of theassociated scale:

• Bump Algebra:Use an orthobasis built from the Meyerwavelets, and measure the norm of the wavelet coeffi-cients by the Besov norm with Thisreduces to a very simple expression, namely,

(12.14)

Membership in the Bump Algebra thus requires that thesum of the wavelet coefficients decay like an appropriatepower of scale. Note, however, that some coefficientscould be large, at the expense of others being corre-spondingly small in order to preserve the size of thesum.

We can also give examples using the Triebel sequence norms.

• : For , use an orthobasis built from, e.g., theMeyer wavelets, and measure the norm of the waveletcoefficients by the Triebel sequence norm with

and precisely the same as thein

• -Sobolev spaces : For , use an orthobasisbuilt from, e.g., the Meyer wavelets, and measure thenorm of the wavelet coefficients by a superposition oftwo Triebel sequence norms, one withand precisely the same as thein ; the other with

and precisely the same as thein

F. Norm Equivalence as a Sampling Theorem

The Besov and Triebel sequence norms have an initiallyopaque appearance. A solid understanding of their structurecomes from a view of norm equivalence as establishing asampling theorem for the upper half-plane, in a manner remi-niscent of the Shannon sampling theorem and its elaborations(12.1) and (12.2).

Recall the Littlewood–Paley theory of the upper half planeof Section XII-A and the use of dyadic rectangles built“above” dyadic intervals from Section XII-B. Partition theupper half-plane according to the family of rectangles.Given a collection of wavelet coefficients , assignto rectangle the value One obtains in this way apseudo- -function

This is a kind of caricature of the Poisson integralUsingthis function as if it were a true Poisson integral suggests tocalculate, for ,

As it turns out, the Triebel sequence norm ispreciselya simplecontinuum norm of the object

In short, the Triebel sequence norm expresses the geometricanalogy that the piecewise-constant objectmay be treatedas if it were a Poisson integral. Why is this reasonable?

Observe that the wavelet coefficient is preciselya sampleof the continuous wavelet transform with respect to the wavelet

generating the orthobasis

The sampling point is wherethese are the coordinates of the lower left corner of the


rectangle In the case, we have

That is, is a pseudo-continuous wavelet transform, gottenby replacing the true continuous wavelet transform on eachcell by a cell-wise constant function with the same value inthe lower left corner of each cell.

The equivalence of the true norm with the Triebelsequence norm of the wavelet coefficients expresses the factthat the piecewise-constant function, built from time-scalesamples of has a norm—defined on the whole time-scaleplane—which is equivalent to Indeed, define a norm onthe time-scale plane by

summarizing first across scales and then across positions. Wehave the identity The equivalence of norms

can be broken into the following stages:

which follows from Littlewood–Paley theory,

which says that the “Poisson wavelet” , and some othernice wavelet obtain equivalent information, and finally

This is a sampling theorem for the upper half-plane, showingthat an object and its piecewise-constant approximation

have equivalent norms. It is exactly analogous to theequivalence (12.3) that we discussed in the context of theShannon sampling theorem. Similar interpretations can begiven for the Besov sequence norm asprecisely a simplecontinuum norm of the object In the case

The difference is that the Besov norm involves first a summa-rization in position and then a summarization in scale; thisis the opposite order from the Triebel case.

XIII. N ORM EQUIVALENCE AND AXES OF ORTHOSYMMETRY

There is a striking geometric significance to the uncondi-tional basis property.

Consider a classical example using the exact norm equiv-alence properties (11.3) and (11.4) from Fourier analysis.Suppose we consider the class consisting of allfunctions obeying

with

This is a body in infinite-dimensional space; defined as it isby quadratic constraints, we call it an ellipsoid. Owing to theexact norm equivalence properties (11.3), (11.4), the axes ofsymmetry of the ellipsoid are precisely the sinusoids

Something similar occurs in the nonclassical cases. Con-sider, for example, the Holder norm equivalence (12.13). Thissays that, up to an equivalent re-norming, the Holder class isa kind of hyperrectangle in infinite-dimensional space. Thishyperrectangle has axes of symmetry; the directions of theseaxes are given by the members of the wavelet basis.

Consider now the Bump Algebra norm equivalence (12.14).This says that, up to an equivalent re-norming, the BumpAlgebra is a kind of octahedron in infinite-dimensional space.This octahedron has axes of symmetry, and these are againgiven by the members of the wavelet basis.

So the unconditional basis propertymeans that thebasisfunctions serve as axes of symmetryfor the correspondingfunction ball. This is analogous to the existence of axes ofsymmetry for an ellipsoid, but is more general: it applies inthe case of function balls which are not ellipsoidal, i.e., notdefined by quadratic constraints.

There is another way to put this that might also be useful.The axes of orthosymmetry of an ellipsoid can be derived aseigenfunctions of the quadratic form defining the ellipsoid.The axes of orthosymmetry solve the problem of “rotatingthe space” into a frame where the quadratic form becomesdiagonal. In the more general setting, where the norm ballsare not ellipsoidal, we can say that an unconditional basissolves the problem of “rotating the space” into a frame wherethe norm, although not involving a quadratic functional, is“diagonalized.”

XIV. B EST ORTHOBASIS FORNONLINEAR APPROXIMATION

The unconditional basis property has important implicationsfor schemes of nonlinear approximation which use the best

-terms in an orthonormal basis. In effect, the unconditionalbasis of a class will be, in a certain asymptotic sense, thebest orthonormal basis for-term approximation of membersof the associated function ball We highlight these resultsand refer the reader to [26] and [30] for more details onnonlinear approximation.

A. -Term Approximations: Linear and Nonlinear

Suppose one is equipped with an orthobasis , andthat one wishes to approximate a functionusing -termexpansions

If the are fixed—for example as the first-basis ele-ments in the assumed ordering—this is a problem of linearapproximation, which can be solved (owing to the assumedorthogonality of the ) by taking Supposingthat the is an enumeration of the integers, theapproximation error in such an orthogonal system is ,which means that the error is small if the important coefficientsoccur in the leading -terms rather than in tail of the sequence.


Fig. 5. Two objects, and performance of best-n and first-n approximations in both Fourier and wavelet bases, and linear and nonlinear approxima-tion numbers (14.2).

Such linear schemes of approximation seem very naturalwhen the orthobasis is either the classical Fourier basis or elsea classical orthogonal polynomial basis, and we think in termsof the “first -terms” as the “ lowest frequencies.” Indeed,several results on harmonic analysis of smooth functionspromote the expectation that the coefficients decaywith increasing frequency index It therefore seems naturalin such a setting to adopt once and for all the standard ordering

, to expect that the decay with , or in other words,that the important terms in the expansion occur at the earlyindices in the sequence. There are many examples in bothpractical and theoretical approximation theory showing theessential validity of this type of linear approximation.

In general, orthobases besides the classical Fourier/orthog-onal polynomial sets (and close relatives) cannot be expectedto display such a fixed one-dimensional ordering of coefficientamplitudes. For example, the wavelet basis has two indices,one for scale and one for position, and it can easily happen thatthe “important terms” in an expansion cluster at coefficientscorresponding to the same position but at many differentscales. This happens, for example, when we consider thecoefficients of a function with punctuated smoothness, i.e., afunction which is piecewise-smooth away from discontinuities.Since, in general, the position of such singularities varies fromone function to the next, in such settings, one cannot expect the“ -most important coefficients” to occur in indices chosen ina fixed nonadaptive fashion. It makes more sense to considerapproximation schemes using-term approximations wherethe -included terms are thebiggest- -terms rather then the

first- -termsin a fixed ordering; compare Fig. 5. Equivalently,we consider the ordering defined by coefficientamplitudes

(14.1)

and define

where again This operator at first glanceseems linear, because the functionals derive from innerproducts of with basis functions ; however, in fact it isnonlinear, because the functionals depend on

This nonlinear approximation operator conforms in thebest possible fashion with the idea that the-most-importantcoefficients should appear in the approximation, while the lessimportant coefficients should not. It is also quantitatively betterthan any fixed linear approximation operator built from thebasis

because the square of the left-hand side is ,which by the rearrangement relation (14.1) is not larger thanany sum


B. Best Orthobasis for -Term Approximation

In general, it is not possible to say much interesting aboutthe error in -term approximation for a fixed function Wetherefore consider the approximation for a function classwhich is a ball of a normed (or quasinormed)linear space of functions. Given such a class and anorthogonal basis , the error of best -term approximation isdefined by

(14.2)

We call the sequence the Stechkin numbers of ,honoring the work of Russian mathematician S. B. Stechkin,who worked with a similar quantity in the Fourier basis.The Stechkin numbers give a measure of the quality ofrepresentation of from the basis If is small,this means that every element ofis well-approximated by -terms chosen from the basis, with the possibility that differentsets of -terms are needed for different

Different orthonormal bases can have very different approx-imation characteristics for a given function class. Suppose weconsider the class BV of all functions on ,with bounded variation as in [31]. For the Fourier sys-tem, BV FOURIER) owing to the fact thatBV contains objects with discontinuities, and sinusoids havedifficulty representing such discontinuities, while for the Haar-wavelet system BV HAAR) intuitively becausethe operator in the Haar expansion can easily adapt to thepresence of discontinuities by including terms in the expansionat fine scales in the vicinity of such singularities and using onlyterms at coarse scales far away from singularities.

Consider now the problem of finding a best orthonormalbasis—i.e., of solving

We call the nonlinear orthowidth of ; this is notone of the usual definitions of nonlinear widths (see [77])but is well suited for our purposes. This width seeks abasis, in some sense, ideally adapted to, getting the bestguarantee of performance in approximation of members ofby the use of adaptively chosen terms. As it turns out, itseems out of reach to solve this problem exactly in manyinteresting cases; and any solution might be very complex, forexample, depending delicately on the choice ofHowever,it is possible, for balls arising from classes which havean unconditional orthobasis and sufficient additional structure,

to obtain asymptotic results. Thus for function ballsdefined by quadratic constraints on the function and itsthderivative, we have

FOURIER

PERIODIC WAVELETS

saying that either the standard Fourier basis, or else a smoothperiodic wavelet basis, is a kind of best orthobasis for such aclass. For the function balls BV consisting of functions onthe interval with bounded variation

BV HAAR BV

so that the Haar basis is a kind of best orthobasis for such aclass. For comparison, a smooth wavelet basis is also a kindof best basis, WAVELETS, BV while a Fourierbasis is not: FOURIER, BV Asummary is given in the table at the bottom of this page.

Results on rates of nonlinear approximation by waveletseries for an interesting range Besov classes were obtainedin [29].

C. Geometry of Best Orthobasis

There is a nice geometric picture which underlies theseresults about best orthobases; essentially,an orthobasis un-conditional for a function class is asymptotically a bestorthobasis for that class. Thus the extensive work of harmonicanalysts to build unconditional bases—a demanding enterprisewhose significance was largely closed to outsiders for manyyears—can be seen as an effort which, when successful, has asa byproduct the construction of best orthobases for nonlinearapproximation.

To see this essential point requires one extra element. Let, and note especially that we include the possibility

that Let denote the th element in the de-creasing rearrangement of magnitudes of entries in, so that

The weak- -norm is

This is really only a quasinorm; it does not in general obeythe triangle inequality. This norm is of interest because of theway it expresses the rate of approximation of operatorsina given basis. Indeed, if then

Name Best Basis

-Sobolev Fourier or Wavelet-Sobolev Wavelet

Holder WaveletBump Algebra WaveletBounded Variation BV HaarSegal Algebra Wilson


The result has a converse

A given function can be approximated by with errorfor all if ; it can only be

approximated with error if Agiven function ball has functions all of which can beapproximated by at error for all

and for some fixed independently of if and only if forsome

Suppose we have a ball arising from a class with anunconditional basis and we consider using a possiblydifferent orthobasis to do the nonlinear approximation:

The coefficient sequence obeyswhere is an orthogonal transformation of Define

then The rate of convergence ofnonlinear approximation with respect to the new system isconstrained by the relation

By the unconditional basis property, there is an orthosym-metric set and constants and such that

Hence, up to homothetic multiples, isorthosymmetric.

Now the fact that is orthosymmetric means that it is insome sense “optimally positioned” about its axes; hence it isintuitive that rotating by cannot improve its position. Thus[31]

We conclude that if is an orthogonal unconditionalbasis for and if , then it is im-possible that for any In-deed, if it were so, then we would have some basisachieving , which would implythat also the unconditional basis achieves

, contradicting the assumption that merelyIn a sense, up to a constant factor

improvement, the axes of symmetry make a best orthogonalbasis.

XV. -ENTROPY OF FUNCTIONAL CLASSES

We now show that nonlinear approximation in an orthogonalunconditional basis of a class can be used, quite generally,to obtain asymptotically, as , the optimal degree of datacompression for a corresponding ball This completes thedevelopment of the last few sections.

A. State of the Art

Consider the problem of determining the Kolmogorov-entropy for a function ball ; this gives the minimal numberof bits needed to describe an arbitrary member ofto withinaccuracy

This is a hard problem, with very few sharp results. Un-derscoring the difficulty of obtaining results in this area is thecommentary of V. M. Tikhomirov in Kolmogorov’sSelectedWorks [94]

The question of finding the exact value of the-entropyis a very difficult one Besides [one specific

example] the author of this commentary knows nomeaningful examples of infinite-dimensional compactsets for which the problem of-entropy is solvedexactly.

This summarizes the state of work in 1992, more than 35years after the initial concept was coined by Kolmogorov. Infact, outside of the case alluded to by Tikhomirov, of-entropyof Lipschitz functions measured in metric, and the casewhich has emerged since then, of smooth functions innormas mentioned above, there are no asymptotically exact results.The typical state of the art for research in this area is thatwithin usual scales of function classes having finitely manyderivatives, one gets order bounds: finite positive constants

and , and an exponent depending on but not on, such that

(15.1)

Such order bounds display some paradigmatic features ofthe state of the art of-entropy research. First, such a resultdoestie down the precise rate involved in the growth of thenet (i.e., as ). Second, itdoes nottie down the precise constants involved in the decay (i.e.,

Third, the result (and its proof) does not directlyexhibit information about the properties of an optimal-net.For a review of the theory see [66].

In this section we consider only the rough asymptotics ofthe -entropy via the critical exponent

If then ; but it is also true that ifthen We should think

of as capturing only crude aspects of the asymptotics of-entropy, since it ignores “log terms” and related phenomena.

We will show how to code in a way which achieves the roughasymptotic behavior.

B. Achieving the Exponent of-Entropy

Suppose we have a function ball of a class withunconditional basis , and that, in addition, has acertain tail compactness for the norm. In particular, supposethat in some fixed arrangement of coordinates

(15.2)

for some


We use nonlinear approximation in the basis to con-struct a binary coder giving an-description for all members

of The idea: for an appropriate squared-distortionlevel , there is an so that the best-term approx-imation achieves an approximation errorWe then approximately representby digitally encoding theapproximant. A reasonable way to do this is to concatenatea lossless binary code for the positions of theimportant coefficients with a lossy binary code for quantizedvalues of the coefficients , taking care that the coefficientsare encoded with an accuracy enabling accurate reconstructionof the approximant

(15.3)

Such a two-part code will produce a decodable binary repre-sentation having distortion for every element of

Here is a very crude procedure for choosing the quantizationof the coefficients: there are coefficients to be encoded; thecoefficients can all be encoded with accuracyusing a totalof bits. If we choose so that , thenwe achieve (15.3).

To encode the positional information giving the locationsof the coefficients used in the approximant, use

(15.2). This implies that there are and so that, with, for a given , we can ignore coefficients outside

the range Hence, the positions can be encodedusing at most bits.

This gives a total codelength of bits.Now if then Hence, assumingonly that and that is minimally tail compact,we conclude that

(15.4)

and that a nonlinear approximation-based coder achieves thisupper bound.

C. Lower Bound via

The upper bound realized through nonlinear approximationis sharp, as shown by the following argument [32]. Assumethat is a ball in a function class that admitsas an unconditional basis. The unconditional basis propertymeans, roughly speaking, that contains many very-high-dimensional hypercubes of appreciable sidelength. As wewill see, theory tells us precisely how many bits arerequired to -faithfully represent objects in such hypercubeschosen on average. Obviously, one cannot faithfully representevery objectin such hypercubes with fewer than the indicatednumber of bits.

Suppose now that (15.2) holds. We shall assume also thatis the best possible exponent; more precisely, we assume

that, for some

(15.5)

For fixed , take such that it attains the nonlinear-width

Let By the unconditionality of the norm, for every sequence of signs the object

belongs to Hence we have constructed anorthogonal hypercube of dimension and sidelength with

Consider now a random vertex of the hypercubeRepresenting this vertex with error is, owing tothe orthogonality of the hypercube, the same as representingthe sequence of signs with error Fix ; nowuse rate-distortion theory to obtain the rate-distortion curve fora binary discrete memoryless source with mean-squared errorfor single-letter difference distortion measure. No-faithfulcode for this source can use essentially fewer thanbits, where Hence setting , we have

(15.6)

Since, for sufficiently large ,we find that

which implies that , orIn other words,

We then conclude from (15.6) that

D. Order Asymptotics of-Entropy

The notion of achieving the exponent of the-entropy, whileleading to very general results, is less precise than we wouldlike. We now know that for a wide variety of classes of smoothfunctions, not only can the exponent be achieved, but also thecorrect order asymptotics (15.1) can be obtained by codersbased on scalar quantization of the coefficients in a waveletbasis, followed by appropriate positional coding [14], [7], [15].

An important ingredient of nonlinear approximation resultsis the ability of the selected -terms to occur at variablepositions in the expansion. However, provision for the selectedterms to occur incompletely arbitraryarrangements of time-scale positions is actually not necessary, and we can obtainbit savings by exploiting the more limited range of possiblepositional combinations. A second factor is that most of thecoefficients occurring in the quantized approximant involvesmall integer multiples of the quantum, and it can be knownin advance that this is typically so at finer scales; provision forthe “big coefficients” to occur in completely arbitrary ordersamong the significant ones is also not necessary. Consequently,there are further bit savings available there as well. Byexploiting these two facts, one can develop coders which arewithin constant factors of the-entropy. We will explain thisfurther in Section XVII below.

XVI. COMPARISON TO TRADITIONAL TRANSFORM CODING

The results of the last few section offer an interestingcomparison with traditional ideas of transform coding theory,for example the theory of Section II.


(a) (b)

Fig. 6. A comparison of “diagonalization” problems. (a) Karhunen–Lo`eve finds axes of symmetry of concentration ellipsoid. (b) Unconditional basisfinds axes of symmetry of function class.

A. Nonlinear Approximation as Transform Coding

Effectively, the above results on nonlinear approximationcan be interpreted as describing a very simple transform codingscheme. Accordingly, the results show thattransform codingin an unconditional basis for a class provides a near-optimalcoding scheme for corresponding function balls

Consider the following very simple transform coding idea.It has the following steps.

• We assume that we are given a certainquantization stepand a certainbandlimit

• Given an to be compressed, we obtain the firstcoefficients of according to the system

.• We then apply simple scalar quantization to those coef-

ficients, with stepsize

• Anticipating that the vast majority of these quantizedcoefficients will turn out to be zero, we apply run-lengthcoding to the quantized coefficients to efficiently codelong runs of zeros.

This is a very simple coding procedure, perhaps even naive.But it turns out that

• if we apply this procedure to functions in the ball ;• if this ball arises from a function class for which

is an unconditional basis;• if this ball obeys the minimal tail compactness condition

(15.2), and the Stechkin numbers obey (15.5);• if the coding parameters are appropriately calibrated;

the coder can achieve-descriptions of everyachieving the rough asymptotics for the-entropy as describedby

Essentially, the idea is that the behavior of the “nonlinearapproximation” coder of Section XV can be realized by the“scalar quantization” coder, when calibrated with the rightparameters. Indeed, it is obvious that the parameterofthe scalar quantizer and the parameter of the nonlinearapproximation coder play the same role; to calibrate the twocoders they should simply be made equal. It then seems natural

to calibrate with via

After this calibration the two coders are roughly equivalent:they select approximants with precisely the same nonzero co-efficients. The nonzero coefficients might be quantized slightlydifferently , but using a few simple estimates derivingfrom the assumption , one can see that theygive comparable performance both in bits used and distortionachieved.

B. Comparison of Diagonalization Problems

Now we are in a position to compare the theory of SectionsXI–XV with the Shannon story of Section II. Thestory of Section II says: to optimally compress a Gaussian sto-chastic process, transform the data into the Karhunen–Loevedomain, and then quantize. The-entropy story of SectionsXI–XV says: to (roughly) optimally compress a function ball

, transform into the unconditional basis, and quantize.In short, an orthogonal unconditional basis of a normed

space plays the same role for function classes as the eigenbasisof the covariance plays for stochastic processes.

We have pointed out that an unconditional basis furnishesa generalization to nonquadratic forms of the concept ofdiagonalization of quadratic form. The Holder class has, afteran equivalent renorming, the shape of a hyperrectangle. TheBump Algebra has, after an equivalent renorming, the shapeof an octahedron. A wavelet basis serves as axes of symmetryof these balls, just as an eigenbasis of a quadratic form servesas axes of symmetry of the corresponding ellipsoid.

For Gaussian random vectors, there is the concept of “el-lipsoid of concentration;” this is an ellipsoidal solid whichcontains the bulk of the realizations of the Gaussian distribu-tion. In effect, the Gaussian- coding procedure identifiesthe problem of coding with one of transforming into the basisserving as axes of symmetry of the concentration ellipsoid. Incomparison, our interpretation of the Kolmogorov- theoryis that one should transform into the basis serving as axes ofsymmetry of the function ball, as illustrated by Fig. 6.


Class Process Basis

Ellipsoid Gaussian Eigenbasisnon- Body non-Gaussian Unconditional Orthobasis

In effect, the function balls that we can describe by waveletbases through norm equivalence, are often nonellipsoidal andin such cases do not correspond to the concentration ellipsoidsof Gaussian phenomena. There is in some sense an analogyhere to finding anappropriate orthogonal transform for non-Gaussian data. In the table at the top of this page, we recordsome aspects of this analogy without taking space to fullyexplain it. See also [34].

The article [27] explored using functional balls as modelsfor real image data. The “Gaussian model” of data is anellipsoid; but empirical work shows that, within the Besovand related scales, the nonellipsoidal cases provide a better fitto real image data. So the “better” diagonalization theory maywell be the nontraditional one.

XVII. B EYOND ORTHOGONAL BASES: TREES

The effort of the harmonic analysis community to developunconditional orthogonal bases can now be seen to haverelevance to data compression and transform coding.

Unconditional bases exist only in very special settings, andthe harmonic analysis community has developed many otherinteresting structures over the years—structures which go farbeyond the concept of orthogonal basis.

We expect these broader notions of representation also tohave significant data compression implications. In this section,we discuss representations based on the notion of dyadic treeand some data compression interpretations.

A. Recursive Dyadic Partitioning

A recursive dyadic partition(RDP) of a dyadic intervalis any partition reachable by applying two rules.

• Starting Rule: itself is an RDP.

• Dyadic Refinement:If is an RDP, andis a partition of the dyadic inter-

val into its left and right dyadic subintervals, thenis a new RDP.

RDP’s are also naturally associated to binary trees; if welabel the root node of a binary tree by and let the twochildren of the node correspond to the two dyadic subintervalsof , we associate with each RDP a tree whose terminalnodes are subintervals comprising members of the partition.This correspondence allows us to speak of methods exploitingRDP’s as “tree-structured methods.”

Recursive dyadic partitioning has played an important rolethroughout the subject of harmonic analysis, as one can seefrom many examples in [89] and [45]. It is associated withideas like the Whitney decomposition of the 1930’s, and theCalderon–Zygmund Lemma in the 1950’s.

A powerful strategy in harmonic analysis is to constructRDP’s according to “stopping time rules” which describe whento stop refining. This gives rise to data structures that are highlyadapted to the underlying objects driving the construction.One then obtains analytic information about the objects ofinterest by combining information about the structure of theconstructed partition and the rule which generated it. In short,recursive dyadic partitioning is a flexible general strategy forcertain kinds of delicate nonlinear analysis.

The RDP concept allows a useful decoration of the time-scale plane, based on the family of time-scale rectangles

which we introduced in Section XII-B. If we think ofall the intervals visited in the sequential construction of anRDP starting from the root as intervals where something“important is happening” and the ones not visited, i.e., thoseoccurring at finer scales than intervals in the partition, as oneswhere “not much is happening,” we thereby obtain an adaptivelabeling of the time-scale plane by “importance.”

Here is a simple example, useful below. Suppose we con-struct an RDP using the rule “stop when no subintervalof the current interval has a wavelet coefficient ;”the corresponding labeling of the time-scale plane shows a“stopping time region” outside of which all intervals areunimportant, i.e., are associated to wavelet coefficientsFor later use, we call this stopping-time region thehereditarycover of the set of wavelet coefficients larger than; it includesnot only the cells associated to “big” wavelet coefficients,but also the ancestors of those cells. The typical appearanceof such a region, in analyzing an object with discontinuities,is that of a region with very fine “tendrils” reaching downto fine scales in the vicinity of the discontinuities; the visualappearance can be quite striking; see Fig. 7.

Often, as in the Whitney and Calderon–Zygmund con-structions, one runs the “stopping time” construction onlyonce, but there are occasions where running it repeatedly isimportant. Doing so will produce a sequence of nested sets;in the hereditary cover example, one can see that running thestopping time argument for will give a sequence ofregions; outside of the th one, no coefficient can be largerthan

In the 1950’s and early 1960’s, Carleson studied problemsof interpolation in the upper half-plane. In this problem, wesuppose we are given prescribed valuesat an irregular setof points in the upper half-plane

and we ask whether there is a bounded functionon theline whose Poisson integral obeys the stated conditions.Owing to the connection with wavelet transforms, this is muchlike asking whether, from a given scattered collection of data


Fig. 7. An object, and the stopping-time region for theq-big coefficients.

about the wavelet transform we can reconstruct theunderlying function. Carleson developed complete answersto this problem, through a method now called the “CoronaConstruction,” based on running stopping time estimates re-peatedly. Recently P. Jones [57] discussed a few far-reachingapplications of this idea in various branches of analysis,reaching the conclusion that it “ is one of the most malleabletools available.”

B. Trees and Data Compression

In the 1960’s, Birman and Solomjak [8] showed how touse recursive dyadic partitions to develop coders achievingthe correct order of Kolmogorov-entropy for -Sobolevfunction classes.

The Birman–Solomjak procedure starts from a functionof interest, and constructs a partition based on a parameterBeginning with the whole domain of interest, the stoppingrule refines the dyadic interval only if approximating thefunction on by a polynomial of degree gives an errorexceeding Call the resulting RDP a-partition. The codingmethod is to construct a-partition, where the parameter

is chosen to give a certain desired global accuracy ofapproximation, and to represent the underlying function bydigitizing the approximating polynomials associated with the-partition. By analyzing the number of pieces of an-

partition, and the total approximation error of a-partition,and by studying the required number of bits to code for theapproximating polynomials on each individual, they showedthat their method gave optimal order bounds on the numberof bits required to -approximate a function in a ball.

This type of result works in a broader scale of settingsthan just the -Sobolev balls. A modern understanding ofthe underlying nonlinear approximation properties of thisapproach is developed in [28].

This approximation scheme is highly adaptive, with thepartition adapting spatially to the structure of the underlyingobject. The ability of the scheme to refine more aggressively incertain parts of the domain than in others is actually necessaryto obtaining optimal order bounds for certain choices of theindex defining the Sobolev norm; if we use normfor measuring approximation error, the adaptive refinementis necessary whenever Such spatially inhomogeneousrefinement allows to use fewer bits to code in areas ofunremarkable behavior and more bits in areas of rapid change.

RDP ideas can also be used to develop wavelet-domaintransform coders that achieve the optimal order-entropybounds over -Sobolev balls , withRecall that the argument of Section XV showed that anycoding of such a space must require bits, whilea very crude transform coding was offered that required order

bits. We are claiming that RDP ideas can beused to eliminate the log term in this estimate [14]. In effect,we are saying that by exploiting trees, we can make the costof coding the position of the big wavelet coefficients at worstcomparable to the-entropy.

The key point about a function in -Sobolev space is that,on average, the coefficients decay with decreasing scale

Indeed, from the -function representation described inSection XII-D, we can see that the wavelet coefficients of


such a function obey, at level

(17.1)

In the range , this inequality controls ratherpowerfully the number and position of nonzero coefficients ina scalar quantization of the wavelet coefficients. Using it inconjunction with RDP ideas allows to code the position of thenonzero wavelet coefficients with far fewer bits than we haveemployed in the run-length coding ideas of Section XV.

After running a scalar quantizer on the wavelet coeffi-cients, we are left with a scattered collection of nonzeroquantizer outputs; these correspond to coefficients exceedingthe quantizer interval Now form the hereditary cover ofthe set of all coefficients exceeding, using the stopping timeargument described in Section XVII-A. The coder will consistof two pieces: a lossless code describing this stopping timeregion (also the positions within the region containing nonzeroquantizer outputs), and a linear array giving the quantizeroutputs associated with nonzero cells in the region.

This coder allows the exact same reconstruction that wasemployed using the much cruder coder of Section XV. Wenow turn to estimates of the number of bits required.

A stopping-time region can be easily coded by recording thebits of the individual refine/don’t refine decisions; we shouldrecord also the presence/absence of a “big” coefficient at eachnonterminal cell in the region, which involves a second arrayof bits.

The number of bits required to code for a stopping-timeregion is, of course, proportional to the number of cells in theregion itself. We will use in a moment the inequality (17.1)to argue that in a certain maximal sense the number of cellsin a hereditary cover is not essentially larger than the originalnumber of entries generating the cover. It will follow from thisthat the number of bits required to code the positions of thenonzero quantizer outputs is in a maximal sense proportionalto the number of nonzero quantizer outputs. This, in turn, isthe desired estimate; the cruder approach of Section XV gaveonly the estimate , where is the number of quantizeroutputs. For a careful spelling-out of the kind of argument weare about to give, see [33, Lemma 10.3].

We now estimate the maximal number of cells in thestopping-time region, making some remarks about the kind ofcontrol exerted by (17.1) on the arrangement and sizes of thewavelet coefficients; compare Fig. 7. The wavelet coefficientsof functions in an -Sobolev ball associated to have a“Last Full Level” : this is the finest level for which onecan find some in the ball such that for all of length

, the associated wavelet coefficient exceedsin absolutevalue. By (17.1), any such obeys , where is thereal solution of

The wavelet coefficients also have a “First Empty Level”, i.e., a coarsest level beyond which all wavelet coefficients

are bounded above in absolute value byBy (17.1), any

nonzero quantizer output occurs at , where

At scales intermediate between the two special values,, there are possibly “sparse levels” which can con-

tain wavelet coefficients exceeding, in a subset of thepositions. However, the maximum possible numberof suchnonzero quantizer outputs at levelthins out rapidly, in factexponentially, with increasing Indeed, we must have

and from we get

withIn short, there are about possible “big” coefficients

at the level nearest , but the maximal number of nonzerocoefficients decays exponentially fast at scales away fromIn an obvious probabilistic interpretation of these facts the“expected distance” of a nonzero coefficient belowis

Now obviously, in forming the hereditary cover of thepositions of nonzero quantizer outputs, we will not obtain aset larger than the set we would get by including all positionsthrough level , and also including all “tendrils” reachingdown to positions of the nonzeros at finer scales than this.The expected number of cells in a tendril is and thenumber of tendrils is Therefore, the maximal numberof cells in the -big region is not more than Themaximal number of nonzeros is of size

These bounds imply the required estimates on the numberof bits to code for positional information.

One needs to spend a little care also on the encoding ofthe coefficients themselves. A naive procedure would spend

on each of coefficients, leading to acrude estimate requiring bits, now to encodethe coefficient information. By making use of the nested setsin the hereditary cover described earlier, one can structure thecoefficients to be retained into layers, in which the label ofthe layer indicates how many (or few) bits need to be spenton coefficients in that layer. One can estimate, similarly towhat was done above, that the layers with large coefficients,for which more bits are required, contain few elements; layerscorresponding to much smaller coefficients have many moreelements, but because their label already restricted their size,we spend fewer bits on them to specify them with the sameprecision. Accounting for the cost in bits of this procedure, onefinds that encoding the coefficients also requiresbits only.

This encoding strategy can be modified so as to be pro-gressive and universal. By adding one bit for each existingcoefficient and new bits to specify the new coefficients andtheir position, it becomes progressive. Each additional streamof bits serves to add new detail to the existing approximationin a nonredundant way. The encoding is also universal in thesense that the encoder does not need to know the character-istics of the class : the encoder is defined once and for alland enjoys the property that each classwhich has as


an unconditional basis will be optimally encoded with respectto Kolmogorov entropy. It is interesting to note that practicalwavelet-based encoders, such as those introduced for imagesin [86] and [82], carry out a similar procedure.

XVIII. B EYOND TIME SCALE: TIME FREQUENCY

The analysis methods described so far—wavelets or trees—were Time-Scalemethods. These methods associate, to asimple function of time only, an object extending over bothtime and scale, and they extract analytic information from thattwo-variable object.

An important complement to this is the family ofTime-Fre-quencymethods. Broadly speaking, these methods try to iden-tify the different frequencies present in an object at time

In a sense, time-scale methods are useful for compressingobjects which display punctuated smoothness—e.g., whichare typically smooth away from singularities. Time-frequencymethods work for oscillatory objects where the frequencycontent changes gradually in time.

One thinks of time-scale methods as more naturally adaptedto image data, while time-frequency methods seem more suitedfor acoustic phenomena. For some viewpoints on mathematicaltime-frequency analysis, see [21], [23], and [39].

In this section we very briefly cover time-frequency ideas,to illustrate some of the potential of harmonic analysis in thissetting.

A. Modulation Spaces

Let be a Gaussian window. Letdenote theGabor function

localized near time and frequency The family of all Gaborfunctions provides a range of oscillatory behavior associatedwith a range of different intervals in time. These could eitherbe called “time-frequency atoms” or, more provocatively,“musical notes.”

Let denote the collection of all functions which canbe decomposed as

where Let denote the smallest valueoccurring in any such decomposition. This provides a normedlinear space of functions, which Feichtinger calls theSegalalgebra [40]. For a given amplitude , the norm constraint

controls the number of “musical notes” of strengthwhich can contain: So such a norm controls

in a way the complexity of the harmonic structure ofA special feature of the above class is that there is no

unique way to obtain a decomposition of the desired type,and no procedure is specified for doing so. As with our earlieranalyses, it would seem natural to seek a basis for this class;one could even ask for an unconditional basis.

This Segal algebra is an instance of Feichtinger’s generalfamily modulation spaces and we could more gen-erally ask for unconditional bases for any of these spaces.The modulation spaces offer an interesting scale of spaces in

certain respects analogous to-Sobolev and related Besovand Triebel scales; in this brief survey, we are unable todescribe them in detail.

B. Continuous Gabor Transform

Even older than the wavelet transform is the continuousGabor transform (CGT). The Gabor transform can be writtenin a form analogous to (12.8) and (12.10)

(18.1)

(18.2)

Here can be a Gabor function as introduced above; it canalso be a “Gabor-like” function generated from a non-Gaussianwindow function

If both and its Fourier transform are “concentrated”around (which is true for the Gaussian), then (18.1) capturesinformation in that is localized in time (around ) andfrequency (around ); (18.2) writes as a superposition ofall the different time-frequency pieces.

Paternity of this transform is not uniquely assignable; it maybe viewed as a special case in the theory of square-integrablegroup representations, a branch of abstract harmonic analysis[47]; it may be viewed as a special case of decomposition intocanonical coherent states, a branch of mathematical physics[59]. In the signal analysis literature it is generally called theCGT to honor Gabor [44], and that appellation is very fitting onthis occasion. Gabor proposed a “theory of communication”in 1946, two years before Shannon’s work, that introducedthe concept of logons—information carrying “cells”—highlyrelevant to the present occasion, and to Shannon’s samplingtheory.

The information in is highly redundant. For instance,captures what happens innot only at time ,

but also near and similarly in frequency. The degree towhich this “spreading” occurs is determined by the choice ofthe window function : if is very narrowly concentratedaround , then the sensitivity of to the behaviorof is concentrated to the vicinity of The HeisenbergUncertainty Principle links the concentration ofwith that ofthe Fourier transform If we define

and

then , so that as becomes more concen-trated, becomes less so, implying that the frequency sen-sitivity of becomes more diffuse when we improvethe time localization of Formalizing this, let denotethe rectangle of dimensions centered at .The choice of influences the shape of the region (morenarrow in time, but elongated in frequency if we choosevery concentrated around); the area of the cell is boundedbelow by the Uncertainty Principle. We think of each point

as measuring properties of a “cell” in the time-frequency domain, indicating the region that is “captured” in


For example, if and are disjoint,we think of the corresponding values as measuring disjointproperties of

C. Atomic Decomposition

Together, (18.1) and (18.2) can be written as

(18.3)

another resolution of the identity operator.For parameters to be chosen, define equispaced time

points and frequency points Considernow a family of time-frequency rectangles

Evidently, these rectangles are disjoint and tile the time-frequency plane.

Proceeding purely formally, we can partition the integralin the resolution of the identity to get a decomposition

with individual operators

This provides formally an atomic decomposition

(18.4)

In order to justify this approach, we would have to justifytreating a rectangle as a single coherent region of thetime-frequency plane. Heuristically, this coherence will applyif is smaller than the spread and if is smallerthan the spread of In short, if the rectangle hasthe geometry of a Heisenberg cell, or smaller, then the aboveapproach makes logical sense.

One application for atomic decomposition would be tocharacterize membership in the Segal AlgebraFeichtingerhas proved that, with a sufficiently nice window, like theGaussian, if we pick and sufficiently small, atomicdecomposition allows to measure the norm ofLet

measure the size of the atom ; then afunction is in if and only if . Moreover,

so the seriesrepresents a decomposition ofinto elements of and

So we have an equivalent norm for the-norm. This in somesense justifies the Gabor “logon” picture, as it shows that anobject can really be represented in terms of elementary pieces,those pieces being associated with rectangular time-frequencycells, and each piece uniformly in

D. Orthobasis

Naturally, one expects to obtain not just an atomic decom-position, but actually an orthobasis. In fact, Gabor believedthat one could do so. One approach is to search for a window

, not necessarily the Gaussian, and a precise choice ofand so that the samples at equispaced points

provide an exact norm equivalence

While this is indeed possible, a famous result due to Balianand Low shows that it is not possible to achieve orthogonalityusing a which is nicely concentrated in both time andfrequency; to get an orthogonal basis from Gabor functionsrequires to have Hence the geometricpicture of localized contributions from rectangular regions isnot compatible with an orthogonal decomposition. A relatedeffect is that the resulting Gabor orthogonal basis would notprovide an unconditional bases for a wide range of modulationspaces. In fact, for certain modulation spaces, nonlinear ap-proximation in such a basis would not behave optimally; e.g.,we have examples of function balls in modulation spaceswhere GABOR ORTHOBASIS .

A way out was discovered with the construction of so-calledWilson orthobases for [22]. These are Gabor-like bases,built using a special smooth window of rapid decay, andconsist of basis elements , where is a position indexand is a frequency index. runs through the integers, andruns through the nonnegative integers; the two parameters varyindependently, except that is allowed in conjunctiononly with even In detail

Owing to the presence of the cosine and sine terms, asopposed to complex exponentials, the are not trulyGabor functions; but they can be viewed as superpositionsof certain pairs of Gabor functions. The Gabor functionsused in those pairs do not fill out the vertices of a singlerectangular lattice, but instead they use a subset of the verticesof two distinct lattices. Hence the information in the Wilsoncoefficients derives indeed from sampling of the CGT withspecial generating window , only the samples must betaken on two interleaved Cartesian grids; and the samples mustbe combined in pairs in order to create the Wilson coefficients,as shown in Fig. 8.

The resulting orthobasis has good analytic properties.Grochenig and Walnut [51] have proved that it offers anunconditional basis for all the modulation spaces; in particular,it is an unconditional basis for As a result, Wilson basesare best orthobases for nonlinear approximation; for functionballs arising from a wide range of modulation spaces


Fig. 8. The sampling set associated with the Wilson basis. Sampling points linked by solid lines are combined by addition. Sampling points linked bydashed lines are combined by subtraction.

WILSON ORTHOBASIS . See also [50]. Thereare related data compression implications, although to statethem requires some technicalities, for example, imposing onthe objects to be compressed some additional decay conditionsin both the time and frequency domains.

E. Adaptive Time-Frequency Decompositions

In a sense, CGT analysis/Gabor Atomic Decomposition/Wilson bases merely scratch the surface of what one hopes toachieve in the time-frequency domain. The essential limitationof these tools is that they aremonoresolution. They derive froma uniform rectangular sampling of time and frequency whichis suitable for some phenomena; but it is easy to come up withmodels requiring highly nonuniform approaches.

Define the multiscale Gabor dictionary, consisting of Gaborfunctions with an additional scale parameter

For , consider the class of objects having amultiscale decomposition

(18.5)

with In a sense, this class is obtained by com-bining features of the Bump Algebra—multiscale decomposi-tion—and the Segal Algebra—time-frequency decomposition.This innocent-looking combination of features responds toa simple urge for common-sense generalization, but it getsus immediately into mathematical hot water. Indeed, onecannot use a simple monoscale atomic decomposition basedon the CGT to effectively decompose such aninto its

pieces at different scales; one cannot from the monoscaleanalysis infer the size of the minimal appearing in sucha decomposition. Moreover, we are unaware of any effectivemeans of decomposing members of this class in a way whichachieves a near-optimal representation, i.e., a representation(18.5) with near-minimal

One might imagine performing a kind of “multiscale Gabortransform”—with parameters time, scale, and frequency, anddeveloping an atomic decomposition or even a basis basedon a three-dimensional rectangular partitioning of this time-scale-frequency domain, but this cannot work without extraprecautions [95]. The kind of sampling theorem and normequivalence result one might hope for has been proven im-possible in that setting.

An important organizational tool in getting an understandingof the situation is to consider dyadic Heisenberg cells only,and to understand the decompositions of time and frequencywhich can be based upon them. These dyadic Heisenberg cells

are of side and volume , with lowerleft corner at in the time-frequency plane. Thedifficulty of the time-frequency-scale decomposition problemis expressed by the fact that each Heisenberg cell overlapswith infinitely many others, corresponding to different aspectratios at the same location. This means that even in a naturaldiscretization of the underlying parameter space, there are awide range of multiscale Gabor functions interacting with eachother at each point of the time-frequency domain.

This kind of dyadic structuring has been basic to thearchitecture of some key results in classical analysis, namely,Carleson’s proof of the a.e. convergence of Fourier series,and also Fefferman’s alternate proof of the Carleson theorem.Both of those proofs were based on dyadic decomposition


of time and frequency, and the driving idea in those proofsis to find ingenious ways to effectively combine estimatesfrom all different Heisenberg cells despite the rather massivedegree of overlap present in the family of all these cells.For example, Fefferman introduced a tree-ordering of dyadicHeisenberg cells, and was able to get effective estimates byconstructing many partitions of the time-frequency domainaccording to the properties of the Fourier partial sums he wasstudying, obtaining a decomposition of the sums into operatorsassociated with special time-frequency regions called trees,and combining estimates across trees into forests.

Inspired by their success, one might imagine that thereis some way to untangle all the overlap in the dyadicHeisenberg cells and develop an effective multiscaletime-frequency analysis. The goal would be to somehoworganize the information coming from dyadic Heisenbergcells to iteratively extract from this time-scale-frequencyspace various “layers” of information. As we have put itthis program is, of course, rather vague.

More concrete is the idea ofnonuniform tiling of the time-frequency plane. Instead of decomposing the plane by asequence of disjoint congruent rectangles, one uses rectanglesof various shapes and sizes, where the cells have been chosenspecially to adapt to the underlying features of the object beingconsidered.

The idea of adaptive tiling is quite natural in connectionwith the problem of multiscale time-frequency decomposition,though it originally arose in other areas of analysis. Feffer-man’s survey [41] gives examples of this idea in action in thestudy of partial differential equations, where an adaptive time-frequency tiling is used to decompose operator kernels for thepurpose of estimating eigenvalues.

Here is how one might expect adaptive tiling to work in amultiscale time-frequency setting. Suppose that an object hada representation (18.5) where the fine-scale atoms occurredat particular time locations well separated from the coarse-scale atoms. Then one could imagine constructing a time-frequency tiling that had homologous structure: rectangleswith finer time scales where the underlying atoms in theoptimal decomposition needed to be fine-scale; rectangleswith coarser time-scale elsewhere. The idea requires morethought than it might at first seem, since the rectangles cannotbe chosen freely; they must obey the Heisenberg constraint.The dyadic Heisenberg system provides a constrained set ofbuilding blocks which often can be fit together quite easilyto create rather inhomogeneous tilings. Also, the rectanglesmust correspond rigorously to an analyzing tool: there mustbe an underlying time-frequency analysis with a width that ischanging with spatial location. It would be especially nice ifone could do this in a way providing true orthogonality.

In any event, it is clear that an appropriate multiscale time-frequency analysis in the setting of or related classescannot be constructed within a single orthobasis. At the veryleast, one would expect to consider large families of or-thobases, and select from such a family an individual basis bestadapted for each individual object of interest. However, eventhere it is quite unlikely that one could obtain true characteri-zation of a space like ; i.e., it is unlikely that even within

the broader situation of adaptively chosen orthobases, the de-compositions one could obtain would rival an optimal decom-position of the form (18.5), i.e., a decomposition with minimal

A corollary to this is that we really know of no effectivemethod for data compression for this class:there is no effectivetransform coding for multiscale time-frequency classes.

XIX. COMPUTATIONAL HARMONIC ANALYSIS

The theoretical constructions of harmonic analysts corre-spond with practical signal-processing tools to a remarkabledegree. The ideas which are so useful in the functional view-point, where one is analyzing functions of a continuousargument, correspond to completely analogous tools for theanalysis of discrete-time signals Moreover, the tools canbe realized as fast algorithms, so that on signals of lengththey take order or operations to complete. Thuscorresponding to the theoretical developments traced above,we have today fast wavelet transforms, fast Gabor transforms,fast tree approximation algorithms, and even fast algorithmsfor adapted multiscale time-frequency analysis.

The correspondence between theoretical harmonic analysisand effective signal processing algorithms has its roots in twospecific facts which imply a rather exact connection betweenthe functional viewpoint of this paper and the digital viewpointcommon in signal processing.

• Fast Fourier Transform:The finite Fourier transform pro-vides an orthogonal transform for discrete-time sequenceswhich, in a certain sense, matches perfectly with theclassical Fourier series for functions on the circle. Forexample, on appropriate trigonometric polynomials, thefirst Fourier series coefficients are (after normaliza-tion) precisely the finite Fourier coefficients of thedigital signal obtained by sampling the trigonometricpolynomial. This fact provides a powerful tool to connectconcepts from the functional setting with the discrete-timesignal setting. The availability of fast algorithms has madethis correspondence a computationally effective matter. Itis an eerie coincidence that the most popular form of theFFT algorithm prefers to operate on signals of dyadiclength; for connecting theoretical harmonic analysis withthe digital signal processing domain, the dyadic length isalso the most natural.

• Sampling Theorem:The classical Shannon sampling the-orem for bandlimited functions on the line has a perfectanalog for digital signals obeying discrete-time bandlim-iting relations. This is usually interpreted as saying thatone can simply subsample a data series and extract theminimal nonredundant subset. A different way to put itis that there is an orthonormal basis for the bandlimitedsignals and that sampling provides a fast algorithm forobtaining the coefficients of a signal in that basis.

There are, in addition, two particular tools for partitioningsignal domains which allow effective digital implementationof breaking a signal into time-scale or time-frequency pieces.

• Smooth Orthonormal Localization:Suppose one takes adiscrete-time signal of length and breaks it into two


subsignals of length , corresponding to the first half andthe second half. Now subject those two halves to furtherprocessing, expanding each half into an orthonormal basisfor digital signals of length In effect, one is expand-ing the original length signal into an orthonormalbasis for length The implicit basis functions maynot be well-behaved near the midpoint. For example, ifthe length basis is the finite Fourier basis, then thelength basis functions will be discontinuous at thesegmentation point. This discontinuity can be avoided bychanging the original “splitting into two pieces” into amore sophisticated partitioning operator based on a kindof smooth orthonormal windowing. This involves treatingthe data near the segmentation point specially, taking pairsof values located equal distances from the segmentationpoint and on opposite sides and, instead of simply puttingone value in one segment and the other value in the othersegment, one puts special pairs of linear combinations ofthe two values in the two halves; see for example [71],[72], and [3].

• Subband Partitioning:Suppose we take a discrete-timesignal, transform it into the frequency domain, and breakthe Fourier transform into two pieces, the high and lowfrequencies. Now transform the pieces back into the timedomain. As the pieces are now bandlimited/bandpass,they can be subsampled, creating two new vectors con-sisting of the ‘high frequency’ and ‘low frequency’ op-erations. The two new vectors are related to the originalsignal by orthogonal transformation, so the process is insome sense exact. Unfortunately, the brutal separationinto high and low frequencies has many undesirableeffects. One solution to this problem would have been toapply smooth orthonormal localization in the frequencydomain. A different approach, with many advantages, isbased on the time domain method ofconjugate quadra-ture filters.

In this approach, one applies a special pair of digitalfilters, high- and lowpass, to a digital signal of length

, and then subsamples each of the two results by afactor two [88], [76], [97], [99]. The result is two signalsof length , so that the original cardinality of ispreserved, andif the filters are very specially chosen, thetransform can be made orthogonal. The key point is thatthe filters can be short. The most elementary example isto use a highpass filter with coefficientsand a lowpass filter with coefficients Theshortness of this filter means that the operator does nothave the time-localization problems of the frequency-domain algorithm, but unfortunately this filter pair willnot have very good frequency-domain selectivity. Moresophisticated filter pairs, with lengths , have been de-veloped; these are designed to maintain the orthogonalityand to impose additional conditions which ensure bothtime- and frequency-domain localization.

This set of tools can lead to fast algorithms for digitalimplementation of the central ideas in theoretical harmonicanalysis.

A key point is that one can cascade the above operations.For example, if one can split a signal domain into twopieces, then one can split it into four pieces, by applyingthe same type of operation again to each piece. In thisway, the dyadic structuring ideas that were so useful inharmonic analysis—dyadic partitioning of time-scale and time-frequency—correspond directly to dyadic structuring in thedigital setting.

We give a few examples of this, stressing the role ofcombining elementary dyadic operations.

• Fast Meyer Wavelet Transform:The Meyer wavelet basisfor was originally defined by its frequency-domainproperties, and so it is most natural to construct a digitalvariant using the Fourier domain. The forward transformalgorithm goes as follows. Transform into the frequencydomain. Apply smooth orthonormal windowing, break-ing up the frequency domain into subbands of pairs ofintervals of width samples, located symmetricallyabout zero frequency. Apply to each subband a Fourieranalysis (actually either a sine or cosine transform) oflength adapted to the length of the subband. The cost ofapplying this algorithm is dominated by the initial passageto the Frequency domain, which is orderThe inverse transform systematically reverses these op-erations.

The point of the fast algorithm is, of course, that onedoes not literally construct the basis functions, and onedoes not literally take the inner product of the digitalsignal with the basis function. This is all done implicitly.However, it is easy enough to use the algorithm to displaythe basis functions. When one does so, one sees that theyare trigonometric polynomials which are periodic andeffectively localized near the dyadic interval they shouldbe associated with.

• Mallat Algorithm: Improving in several respects on thefrequency-domain digital Meyer wavelet basis is a familyof orthonormal wavelet bases based on time-domain filter-ing [69]. The central idea here is by now very well known:it involves taking the signal, applying subband partition-ing with specially chosen digital highpass and lowpassfilters, subsampling the two pieces by dyadic decimation,and then recursively applying the same procedure on thelowpass piece only. When the filters are appropriatelyspecified, the result is an orthonormal transform on-space.

The resulting transform on digital signals takes onlyorder arithmetic operations, so it has a speed advantageover the fast Meyer wavelet transform, which requiresorder operations. Although we do not describeit here, there is an important modification of the filteringoperators at the ends of the sequence which allows thewavelet basis functions to adapt to the boundary of thesignal, i.e., we avoid the periodization of the fast Meyertransform [13]. Finally, the wavelet basis functions arecompactly supported, with basis element vanishingin the time domain outside an interval homothetic toThis means that they have the correct structure to be


called a wavelet transform. The support property alsomeans that a small number of wavelet coefficients ata given scale are affected by singularities, or to put itanother way, the effect of a singularity is compressedinto a small number of coefficients.

• Fast Wilson Basis Transformis a digital version of theWilson basis; the fast transform works as follows. First,one applies a smooth orthonormal partitioning to break upthe time domain into equal-length segments. (It is mostconvenient if the segments have dyadic lengths.) Thenone applies Fourier analysis, in the form of a sine orcosine transform, to each segment. The whole algorithmis order , where is the length of a segment.The implicitly defined basis functions look like windowedsinusoids with the same arrangement of sine and cosineterms as in the continuum Wilson basis.

We should now stress that for the correspondence betweena theoretical harmonic analysis concept and a computationalharmonic analysis tool, dyadic structuring operators should notbe performed in a cavalier fashion. For example, if one isgoing to cascade subband partitioning operations many times,as in the Mallat algorithm, it is important that the underlyingfilters be rather specially chosen to be compatible with thisrepetitive cascade.

When this is done appropriately, one can arrive at digitalimplementations that are not vague analogies of the corre-sponding theoretical concepts, but can actually be viewedas “correct” digital realizations. As the reader expects bynow, the mathematical expression that one has a “correct”realization is achieved by establishing a norm equivalenceresult. For example, if in the Mallat algorithm one cascades thesubband partitioning operator using appropriate finite-lengthdigital filters, one can construct a discrete wavelet transformfor digital signals. This discrete transform is “correctly re-lated” to a corresponding theoretical wavelet transform onthe continuum because of a norm equivalence: if isa function on the interval , and if is a set ofdigital wavelet coefficients for the digitally sampled object

, then the appropriate Triebel and Besov norms ofthe digital wavelet coefficients behave quite precisely likethe same norms of the corresponding initial segments of thetheoretical wavelet coefficients of the continuum Thisis again a form of sampling theorem, showing that anequality of norms (which follows simply from orthogonality)is accompanied by an equivalence in many other norms (whichfollows from much more delicate facts about the construction).A consequence, and of particular relevance for this paper, isthat under simple assumptions, one can use the digital wavelettransform coefficients for nonlinear approximation and expectthe same bounds on approximation measures to apply;hencedigital wavelet-based transform coding of function classesobeys the same types of estimates as theoretical wavelet-basedtransform coding.

The “appropriate finite-length filters” referred to in thelast paragraph are in fact arrived at by a delicate processof design. In [20], a collection of finite-length filters wasconstructed that gave orthonormal digital wavelet transforms.

For the constructed transforms, if one views thedigitalsamples as occurring at points for , andtakes individual digital basis elements corresponding to the“same” location and scale at different dyadic, appropriatelynormalized and interpolated to create functions of a continuousvariable, one obtains a sequence of functions which tends toa limit. Moreover, the limit must be a translate and dilate of asingle smooth function of compact support. In short, the digitalwavelet transform is truly a digital realization of the theoreticalwavelet transform. The norm equivalence statement of theprevious paragraph is a way of mathematically completing thisfundamental insight. In the last ten years, a large number ofinteresting constructions of “appropriate finite-length filters”have appeared, which we cannot summarize here. For morecomplete information on the properties of wavelet transformsand the associated filters, see for example, [100] and [23].

The success in developing ways to translate theoreticalwavelet transforms and Gabor transforms into computationallyeffective methods naturally breeds the ambition to do thesame in other cases. Consider the dyadic Heisenberg cellsof Section XVIII-E, and the resulting concept of nonuniformtiling of the time-frequency plane. It turns out that for alarge collection of nonuniform tilings, one can realize—ina computationally effective manner—a correspondingorthonormal basis [18]; see Fig. 9.

Here are two examples:

• Cosine Packets:Take a recursive dyadic partition ofthe time interval into dyadic segments. Apply smoothorthonormal partitioning to separate out the original sig-nal into a collection of corresponding segments. Takeappropriate finite Fourier transforms of each segment.The result is an orthogonal transform which is associatedwith a specific tiling of the time-frequency domain. Thattiling has, for its projection on the time axis, simply theoriginal partition of the time domain; over the whole time-frequency domain, it consists of columns whose widthsare defined by intervals of the time-partition; withina column, the tiling simply involves congruent dyadicHeisenberg cells with specified width.

• Wavelet Packets:Take a recursive dyadic partition of thefrequency interval into dyadic segments. Apply subbandpartitioning to separate out the original signal into acollection of corresponding frequency bands. The resultis an orthogonal transform which costs at mostoperations. It is associated with a specific tiling of thetime-frequency domain. That tiling has, for its projectionon the frequency axis, simply the original partition ofthe frequency domain; over the whole time-frequencydomain, it consists of rows whose widths are definedby intervals of the frequency-partition; within a row, thetiling simply involves congruent dyadic Heisenberg cellswith specified height.

Do these bases correspond to “correct” digital implementa-tion of the theoretical partitioning? While they are orthogonal,we are not aware of results showing that they obey a widerange of other norm equivalences. It would be interesting to


Fig. 9. Some time-frequency tilings corresponding to orthonormal bases.

know if they obeysufficiently strong equivalences to implythat transform coding in an adaptively constructed basis canprovide near-optimal codingfor an interesting class of objects.At the very least, results of this kind would require detailedassumptions, allowing the partitioning to be inhomogeneousin interesting ways and yet not very wild, and also supposingthat the details of window lengths in the smooth orthonormalwindowing or the filter choice in the subband partitioning arechosen appropriately.

The time-frequency tiling ideas raise interesting possibili-ties. In effect, the wavelet packet and cosine packet librariescreate large libraries of orthogonal bases, all of which havefast algorithms. The Fourier and Wavelet bases are just twoexamples in this library; Gabor/Wilson-like bases provideother examples. These two collections have been studiedby Coifman and Wickerhauser [17], who have shown thatfor certain “additive” objective functions the search throughthe library of all cosine packet or wavelet packet bases canbe performed in operations. This search isdependent on the function to be analyzed, so it is an instance ofnonlinear approximation. In the case when this search is donefor compression purposes, an operational rate-distortion-basedversion was presented in [80].

XX. PRACTICAL CODING

How do the ideas from harmonic analysis work onrealdata?

Actually, many of these ideas have been in use for sometime in the practical data compression community, thoughdiscovered independently and studied under other names.

This is an example of a curious phenomenon commentedon by Yves Meyer [73]: some of the really important ideasof harmonic analysis have been discovered, in some cognateor approximate form, in a wide range of practical settings,ranging from signal processing to mathematical physics, tocomputer-aided design. Often the basic tools are the same, butharmonic analysis imposes a different set of requirements onthose tools.

To emphasize the relationship of the theory of this paper topractical coders, we briefly comment on some developments.

DCT coding of the type used in JPEG is based on apartitioning of the image domain into blocks, followed byDCT transform coding. The partitioning is done brutally, withthe obvious drawback of DCT coding, which is the blockingeffect at high compression ratios. Subband coding of imageswas proposed in the early 1980’s [98], [102]. Instead ofusing rectangular windows as the DCT does, the subbandapproach effectively uses longer, smooth windows, and thusachieves a smooth reconstruction even at low bit rates. Notethat early subband coding schemes were trying to approximatedecorrelating transforms like the KLT, while avoiding some ofthe pitfalls of the DCT. Coding gains of filter banks were usedas a performance measure. Underlying such investigations wasa Gaussian model for signals, or a Markov random field (MRF)model for images.

Among possible subband coding schemes for images, aspecific structure enjoyed particular popularity, namely, thescheme where the decomposition of the lower frequency bandis iterated. This was due to several factors:

1) Its relationship with pyramid coding.


2) The fact that most of the energy remains concentratedin the lowpass version.

3) Its computational efficiency and simplicity.

Of course, this computational structure is equivalent to thediscrete wavelet transform, and, as we have described, withappropriately designed digital filters, it can be related to thecontinuous wavelet series expansion.

Because of the concentrated efforts on subband codingschemes related to discrete wavelet transforms, interesting ad-vances were achieved leading to improved image compression.The key insight derived from thinking in scale-location terms,and realizing that edges caused an interesting clustering effectacross scales: the positions of “big” coefficients would remainlocalized in the various frequency bands, and could thereforebe efficiently indexed using a linking across frequencies [68].This idea was perfected by J. Shapiro into a data structurecalled an embedded zero tree [86]. Together with a successiveapproximation quantization, this coder achieves high-quality,successive approximation compression over a large rangeof bit rates, outperforming JPEG at any given bit rate byseveral decibels in signal-to-noise ratio. Many generalizationsof Shapiro’s algorithm have been proposed, and we will brieflyoutline the basic scheme to indicate its relation to nonlinearapproximation in wavelet decompositions.

A key idea is that the significant wavelet coefficients arewell-localized around points of discontinuity at small scales(or high frequency), see Fig. 10(a). This is unlike local cosineor DCT bases. Therefore, an edge of a given orientationwill appear roughly as an edge in the respective orientationsubband, and this at all scales. Conversely, smooth areas willbe nulled out in passbands, since the wavelet has typicallyseveral zero moments. Therefore, a conditional entropy codercan take advantage of this “dependence across scales.” Inparticular, the zero tree structure gathers regions of low energyacross scales, by simply predicting that if a passband is zeroat a given scale, its four children at the next finer scale (andsimilar orientation) are most likely to be zero as well (see Fig.10(b)). This scheme can be iterated across scales. In somesense, it is a generalization of the end of block symbol used inDCT coding. (Note that such a prediction across scales wouldbe fruitless in a Gaussian setting, because of independenceof different bands.) Also, note that the actual values of thecoefficients arenot predicted, but only the “insignificance” orabsence of energy; i.e., the idea is one of positional coding.Usually, the method is combined with successive approxima-tion quantization, leading to an embedded bit stream, wheresuccessive approximation decoding is possible. An improvedversion, where larger blocks of samples are used together in atechnique called set partitioning, lead to the SPIHT algorithm[82] with improved performance and lower computationalcomplexity. Other variations include context modeling in thevarious bands [65]. Comparing these ideas with the presentpaper, we see a great commonality of approach to the use oftrees for positional coding of the “big” wavelet coefficients aswe have discussed in Section XVII-B.

At a more abstract level, such an approach to wavelet imagecoding consists in picking a subset of the largest coefficientsof the wavelet transform, and making sure that the cost of

(a)

(b)

Fig. 10. Localization of the wavelet transform. (a) One-dimensional signalwith discontinuity. (b) Two-dimensional signal and linking of scale-spacepoints used in wavelet image coding (EZW and SPIHT).

addressing these largest coefficients is kept down by a smartconditional entropy code. That is, the localization property ofthe wavelet transform is used to perform efficient addressingof singularities, while the polynomial reproduction property ofscaling functions allows a compact representation of smoothsurfaces.

How about the performance of this coder and its relatedcousins? In Fig. 3, we see that a substantial improvement isachieved over JPEG (the actual coder is from [82]). However,the basic behavior is the same, that is, at fine quantization,we find again the typical slope predicted byclassical high rate-distortion theory. Instead, we would hopefor a decay related to the smoothness class. At low bit rate, arecent analysis by Mallat and Falzon [70] showsby modeling the nonlinear approximation features of such awavelet coder.

The next generation image coding standard, JPEG-2000, isconsidering schemes similar to what was outlined above. Thatis, it has been proposed to exploit properties of the wavelettransform and of the structure of wavelet coefficients acrossscales, together with state-of-the-art quantization (using, forexample, trellis-coded quantizers) and adaptive entropy coders.


In short, there are a variety of interesting parallels betweenpractical coding work and the work in harmonic analysis.We are aware of other areas where interesting comparisonscan be made, for example, in speech coding, but omit a fullcomparison of literatures for reasons of space.

XXI. SUMMARY AND PROGNOSIS

In composing this survey, we have been inspired by an at-tractive mixture of ideas. To help the reader, we find it helpfulto summarize these ideas, and to make them memorable byassociating them with prominent figures of this century.

A. Pure Mathematics Disdaining Usefulness

G. H. Hardy is a good example for this position. InAMathematician’s Apology, he gave a famous evaluation of hislife’s work:

“I have never done anything ‘useful.’ No discoveryof mine has made, or is likely to make, directly orindirectly, for good or for ill, the least difference to theworld.”

In the same essay, he argued strenuously against the propo-sition that pure mathematics could ever have “utility.”

The irony is, of course, that, as discussed above, the purelymathematical impulse to understand Hardy spaces gaverise to the first orthonormal basis of smooth wavelets.

In this century, harmonic analysis has followed its ownagenda of research, involving among other things the under-standing of equivalent norms in functional spaces. It has devel-oped its own structures and techniques for addressing problemsof its own devising; and if one studies the literature ofharmonic analysis one is struck by the way the problems—e.g.,finding equivalent norms for spaces—seem unrelated toany practical needs in science or engineering—e.g.,spacesdo not consist of “objects” which could be understood as mod-eling a collection of “naturally occurring objects” of “practicalinterest.” Nor should this be surprising; David Hilbert oncesaid that “the essence of mathematics is its freedom” andwe suspect he means “freedom from any demands of theoutside world to be other than what its own internal logicdemands.”

Despite the freedom and in some sense “unreality” ofits orientation, the structures and techniques that harmonicanalysis has developed in pursuing its agenda have turnedout to be significant for practical purposes—in particular,as mentioned above, discrete wavelet transforms are beingconsidered for inclusion in future standards like JPEG-2000.

In this paper we have attempted to “explain” how anabstractly oriented endeavor could end up interacting withpractical developments in this way. In a sense, harmonicanalysts, by carrying out their discipline’s internal agenda,discovered a way to “diagonalize” the structure of certainfunctional classes outside of the cases where this conceptarose. This “diagonalization” carries a data compression in-terpretation, and leads to algorithmic ideas for dealing withother kinds of data than traditional Gaussian data compressiontheory allows.

B. Stochastic versus Deterministic Viewpoints

Even if one accepts that pure mathematics can have unex-pected outcomes of relevance to practical problems, it mayseem unusual that analysisper se—which concerns determin-istic objects—could be connected with the compression of realdata, which concerns random objects. To symbolize this pos-sibility, we make recourse to one of the great mathematiciansof this century, A. N. Kolmogorov. V. M. Tikhomirov, in anappreciation of Kolmogorov, said [93]

“our vast mathematical world” is itself dividedinto two parts, as into two kingdoms. Deterministicphenomena are investigated in one part, and randomphenomena in the other.

To Kolmogorov fell the lot of being a trailblazer inboth kingdoms, a discoverer in their many unexploredregions he put forth a grandiose programme for asimultaneous and parallel study of the complexity ofdeterministic phenomena and the uncertainty of randomphenomena, and the experience of practically all hiscreative biography was concentrated in this programme.

From the beginning of this programme, the illusorynature of setting limits between the world of order andthe world of chance revealed itself.

The scholarly record makes the same point. In his paperat the 1956 IRE Conference on Information Theory, held atMIT, Kolmogorov [60] published the “reverse water-fillingformula” giving for Gaussian processes (2.5), (2.6),which as we have seen is the formal basis for transformcoding. He also described work with Gel’fand and Yaglomrigorously extending the sense of the mutual informationformula (2.4) from finite-dimensional vectors and tofunctional data. These indicate that the functional viewpointwas very much on his mind in connection with the Shannontheory. In that same paper he also chose to mention the-entropy concept which he had recently defined [61], and hechose a notation which allowed him to make the point that-entropy is formally similar to Shannon’s One gets

the sense that Kolmogorov thought the two theories mightbe closely related at some level, even though one concernsdeterministic objects and the other concerns random objects;see also his return to this theme in more depth in the appendixin the monograph [62].

C. Analysts Return to Concrete Arguments

Shannon’s key discoveries in lossy data compression aresummarized in (2.2)–(2.4). These are highly abstract andin fact can be proved in an abstract setting; compare theabstract alphabet source coding theorem and converse inBerger (1971). In this sense, Shannon was a man of his time.During the 1930’s–1950’s, abstraction in mathematical sciencewas in full bloom; it was the era that produced the formalistmathematical school of Bourbaki and fields like “abstractharmonic analysis.”

Kolmogorov’s work in lossy data compression also wasa product of this era; the concept of-entropy is, to manynewcomers’ tastes, abstraction itself.


Eventually the pendulum swung back, in a return to moreconcrete arguments and constructions, as illustrated by thework of the Swedish mathematician Lennart Carleson. In avery distinguished research career spanning the entire periodfrom 1950 to the present, Carleson obtained definitive resultsof lasting value in harmonic analysis, the best known of whichis the almost everywhere convergence of Fourier series, anissue essentially open since the 19th century, and requiringa proof which has been described by P. Jones as “one ofthe most complicated to be found in modern analysis” [57].Throughout his career, Carleson created new concepts andtechniques as part of resolving very hard problems; includinga variety of dyadic decomposition ideas and the concept ofCarleson measures. It is interesting to read Carleson’s ownwords [11].

There was a period, in the 1940’s and 1950’s, whenclassical analysis was considered dead and the hopefor the future of analysis was considered to be inthe abstract branches, specializing in generalization. Asis now apparent, the death of classical analysis wasgreatly exaggerated and during the 1960’s and 1970’sthe field has been one of the most successful in all ofmathematics. the reasons for this [include] therealization that in many problems complications cannotbe avoided, and that intricate combinatorial argumentsrather than polished theories are in the center.

Carleson “grew up” as a young mathematician in the early1950’s, and so it is natural that he would react againstthe prevailing belief system at the time of his intellectualformation. That system placed great weight on abstractionand generality; Carleson’s work, in contrast, placed heavyemphasis on creating useful tools for certain problems whichby the standards of abstract analysts of the day, were decidedlyconcrete.

Carleson can be taken as symbolic of the position that aconcrete problem, though limited in scope, can be very fertile.

D. The Future?

So far, we have used important personalities to symbolizeprogress to date. What about the future?

One of the themes of this paper is that harmonic analysts,while knowingly working on hard problems in analysis, anddiscovering tools to prove fundamental results, have actuallybeen developing tools with a broad range of applications,including data compression. Among harmonic analysts, thisposition is championed by R. R. Coifman. His early workincluded the development of atomic decompositions forspaces, Today, his focus is in another direction entirely,as he develops ways to accelerate fundamental mathematicalalgorithms and to implement new types of image compression.This leads him to reinterpret standard harmonic analysis resultsin a different light.

For example, consider his attitude toward a milestoneof classical analysis: L. Carleson’s proof of the almost-everywhere convergence of Fourier series, which is generallythought of as a beautiful and extremely complex pure analysisargument. But apparently Coifman sees here a serious effort

to understand the underlying “combinatorics” of time andfrequency bases, a “combinatorics” potentially also useful(say) for “time-frequency-based signal compression.”

In another direction, consider the work of P. Jones [56]who established a beautiful result showing that one can ap-proximately measure the length of the traveling salesmantour of a set of points in the plane by a kind of nonlinearLittlewood–Paley analysis. (This has far-reaching extensionsby David and Semmes [24].) Others may see here the begin-nings of a theory of quantitative geometric measure theory.Coifman apparently sees a serious effort to understand theunderlying combinatorics of curves in the plane (and inDavid–Semmes, hypersurfaces in higher dimensional spaces),a “combinatorics” which is potentially also useful (say) forcompressing two- and higher dimensional data containingcurvilinear structure.

The position underlying these interpretations is exactly theopposite of Hardy: Hardy believed that his research would notbe “useful” because he did notintend it to be; yet, it turnsout that research in harmonic analysis has been and may wellcontinue to be “useful” even when researchers, like Hardy,have no conscious desire to be useful.

How far can the connection of Harmonic Analysis and DataCompression go? We are sure it will be fun to find out.

ACKNOWLEDGMENT

The authors would like to thank R. R. Coifman for exposingus to many iconoclastic statements which were provocative,initially mysterious, and ultimately very fruitful. They wouldalso like to thank the editors of this special issue. Sergio Verduhad the idea for assembling this team for this project; both heand Stephen McLaughlin displayed great patience with us inthe stages near completion.

D. L. Donoho would like to thank Miriam Donoho forvigorously supporting the effort to make this paper appear. Hewould also like to thank Xiaoming Huo for extensive effort inpreparing several figures for this paper.

M. Vetterli would like to thank Robert M. Gray, VivekGoyal, Jelena Kovaˇcevic, Jerome Lebrun, Claudio Weidmann,and Bin Yu for interactions and help.

I. Daubechies would like to thank Nela Rybowicz for herextraordinary helpfulness in correcting the page proofs.

REFERENCES

[1] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,”IEEE Trans. Comput., vol. C-23, pp. 88–93, Jan. 1974.

[2] E. W. Aslaksen and J. R. Klauder, “Unitary representations of the affinegroup,” J. Math. Phys., vol. 9, pp. 206–211, 1968.

[3] P. Auscher, G. Weiss, and G. Wickerhauser, “Local sine and cosinebases of Coifman and Meyer and the construction of smooth wavelets,”in Wavelets: A Tutorial in Theory and Applications, C. K. Chui, Ed.New York: Academic, 1992.

[4] H. B. Barlow, “Possible principles underlying the transformation ofsensory messages,” inSensory Communication, W. A. Rosenbluth, Ed.Cambrige, MA: MIT Press, 1961, pp. 217–234.

[5] T. Berger, Rate Distortion Theory: A Mathematical Basis for DataCompression. Englewood Cliffs, NJ: Prentice Hall, 1971.

[6] L. Birge, “Approximation dans les espaces metriques et theorie del’estimation,” (in French), (“Approximation in metric spaces and thetheory of estimation”),Z. Wahrsch. Verw. Gebiete, vol. 65, no. 2, pp.181–237, 1983.


[7] L. Birge and P. Massart, “An adaptive compression algorithm in Besovspaces,”Constr. Approx., to be published.

[8] M. S. Birman and M. Z. Solomjak, “Piecewise-polynomial approxi-mations of functions of the classesW�

p;” Mat. Sbornik, vol. 73, pp.

295–317, 1967.[9] A. P. Calderon, “Intermediate spaces and interpolation, the complex

method,”Studia Math., vol. 24, pp. 113–190, 1964.[10] B. Carl and I. Stephani,Entropy, Compactness, and the Approximation

of Operators. Cambridge, U.K.: Cambridge Univ. Press, 1990.[11] L. Carleson, “The work of Charles Fefferman,” inProc. Int. Congr.

Mathematics(Helsinki, Finland, 1978). Helsinki, Finland: Acad. Sci.Fennica, 1980, pp. 53–56.

[12] A. Cohen and J. P. D’Ales, “Non-linear approximation of randomfunctions,” SIAM J. Appl. Math., vol. 57, pp. 518–540, 1997.

[13] A. Cohen, I. Daubechies, and P. Vial, “Orthonormal wavelets for theinterval,” Appl. Comput. Harmonic Anal., vol. 1, no. 1, 1993.

[14] A. Cohen, I. Daubechies, O. Guleryuz, and M. Orchard, “On theimportance of combining wavelet-based nonlinear approximation incoding strategies,” unpublished manuscript, 1997.

[15] A. Cohen, W. Dahmen, I. Daubechies, and R. A. DeVore, “Tree,Approximation and Encoding,” preprint, 1998..

[16] R. R. Coifman, “A real-variable characterization ofHp,” Studia Math.,vol. 51, pp. 269–274, 1974.

[17] R. R. Coifman and M. V. Wickerhauser, “Entropy-based algorithms forbest basis selection,” vol. 38, no. 2, pp. 713–718, Mar. 1992.

[18] R. R. Coifman and Y. Meyer, “Remarques sur l’analyze de Fourierafenetre,C. R. Acad Sci. Paris, vol. 312, pp. 259–261, 1991.

[19] T. M. Cover and J. A. Thomas,Elements of Information Theory. NewYork: Wiley, 1991.

[20] I. Daubechies, “Orthonormal bases of compactly supported wavelets,”Commun. Pure Appl. Math., vol. 41, pp. 909–996, 1988.

[21] , “The wavelet transform, time-frequency localization, and signalanalysis,”IEEE Trans. Inform. Theory, vol. 36, pp. 961–1005, 1990.

[22] I. Daubechies, S. Jaffard, and J. L. Journe, “A simple Wilson orthonor-mal basis with exponential decay,”SIAM J. Math. Anal., vol. 24, pp.520–527, 1990.

[23] I. Daubechies,Ten Lectures on Wavelets. Philadelphia, PA: SIAM,1992.

[24] G. David and S. Semmes,Analysis of and on Uniformly Rectifiable Sets.(Amer. Math. Soc., Mathematical Surveys and Monographs, vol. 38),1993.

[25] K. DeLeeuw, J. P. Kahane, and Y. Katznelson, “Sur les coefficients deFourier des fonctions continues,”C. R. Acad. Sci. Paris, vol. 285, pp.1001–1003, 1978.

[26] R. A. DeVore, “Nonlinear approximation,”Acta Numer., vol. 7, pp.51–150, 1998.

[27] R. A. DeVore, B. Jawerth, and B. Lucier, “Image compression throughwavelet transform coding,”IEEE Trans. Inform. Theory, vol. 38, pp.719–746, 1992.

[28] R. DeVore and X. M. Yu, “Degree of adaptive approximation,”Math.Comput., vol. 55, pp. 625–635, 1990.

[29] R. DeVore, B. Jawerth, and V. A. Popov, “Compression of waveletdecompositions,”Amer. J. Math., vol. 114, pp. 737–785, 1992.

[30] R. A. DeVore and G. G. Lorentz,Constructive Approximation. NewYork: Springer, 1993.

[31] D. L. Donoho, “Unconditional bases are optimal bases for data com-pression and statistical estimation,”Appl. Comput. Harmonic Anal., vol.1, no. 1. pp. 100–105, 1993.

[32] , “Unconditional bases and bit-level compression,”Appl. Comput.Harmonic Anal., vol. 3, no. 4, pp. 388–392, 1996.

[33] , “CART and best-ortho-basis: A connection,”Ann. Statist., vol.25, no. 5, pp. 1870–1911, 1997.

[34] , “Counting bits with Kolmogorov and Shannon,” manuscript,1998.

[35] R. M. Dudley, “The sizes of compact subsets of Hilbert spaces andcontinuity of Gaussian processes,”J. Funct. Anal., vol. 1, pp. 290–330,1967.

[36] D. E. Edmunds and H. Triebel,Function Spaces, Entropy Numbers,and Differential Operators. Cambridge, U.K.: Cambridge Univ. Press,1996.

[37] D. J. Field, “The analysis of natural images and the response propertiesof cortical cells,” J. Opt. Soc. Amer., 1987.

[38] , “What is the goal of sensory coding?”Neural Comput., vol. 6,no. 4, pp. 559–601, 1994.

[39] G. Folland,Harmonic Analysis in Phase Space. Princeton, NJ: Prince-ton Univ. Press, 1989.

[40] H. G. Feichtinger, “Atomic characterizations of modulation spacesthrough Gabor-type representations,”Rocky Mount. J. Math., vol. 19,pp. 113–126, 1989.

[41] C. Fefferman, “The uncertainty principle,”Bull. Amer. Math. Soc., vol.9, pp. 129–206, 1983.

[42] M. Frazier and B. Jawerth, “The�-transform and applications todistribution spaces,” inFunction Spaces and Applications(Lecture Notesin Mathematics, vol. 1302), M. Cwikel,et al., Eds. Berlin, Germany:Springer-Verlag, 1988, pp. 223–246.

[43] M. Frazier, B. Jawerth, and G. Weiss,Littlewood–Paley Theory and theStudy of Function Spaces(CBMS Reg. Conf. Ser. in Mathematics, no.79). Amer. Math. Soc., 1991.

[44] D. Gabor, “Theory of communication,”J. Inst. Elect. Eng., vol. 93, pp.429–457, 1946.

[45] J. Garnett,Bounded Analytic Functions. New York: Academic, 1981.[46] A. Gersho and R. M. Gray,Vector Quantization and Signal Compression.

Boston, MA: Kluwer, 1992.[47] R. Godement, “Sur les relations d’orthogonalite de V. Bargmann,”C.

R. Acad. Sci. Paris, vol. 255, pp. 521–523, 657–659, 1947.[48] R. M. Gray and D. L. Neuhoff, “Quantization,”IEEE Trans. Inform.

Theory, this issue, pp. 2325–2383.[49] V. Goyal and M. Vetterli “Computation-distortion characteristics of

block transform coding,” inICASSP-97(Munich, Germay, Apr. 1997),vol. 4, pp. 2729–2732.

[50] K. Grochenig and S. Samarah, “Nonlinear approximation with localFourier bases,” unpublished manuscript, 1998.

[51] K. Grochenig and D. Walnut, “Wilson bases are unconditional bases formodulation spaces,” unpublished manuscript, 1998.

[52] A. Grossmann and J. Morlet, “Decomposition of Hardy functions intosquare-integrable wavelets of constant shape,”SIAM J. Appl. Math., vol.15, pp. 723–736, 1984.

[53] G. H. Hardy, A Mathematician’s Apology. Cambridge, U.K.: Cam-bridge Univ. Press, 1940.

[54] R. Howe, “On the role of the Heisenberg group in hramonic analysis.”Bull Amer. Math. Soc., vol. 3, pp. 821–843, 1980.

[55] J. J. Y. Huang and P. M. Schultheiss, “Block quantization of correlatedGaussian random variables,”IEEE Trans. Commun., vol. CUM-11, pp.289–296, Sept. 1963.

[56] P. Jones, “Rectifiable sets and the travelling salesman problem,”Inven-tiones Mathematicae, vol. 102, pp. 1–15, 1990.

[57] , “Lennart Carleson’s work in analysis,” inFestschrift in Honorof Lennart Carleson and Yngve Domar, (Acta Universitatis Upsaliensis,vol. 58). Stockholm, Sweden: Almquist and Wiksell Int., 1994.

[58] J. P. Kahane and P. G. Lemarie-Rieusset,Fourier Series and Wavelets.Luxemburg: Gordon and Breach, 1995.

[59] J. Klauder and B. Skagerstam,Coherent States, Applications in Physicsand Mathematical Physics. Singapore: World Scientific, 1985.

[60] A. N. Kolmogorov, “On the Shannon theory of information transmissionin the case of continuous signals,”IRE Trans. Inform. Theory, vol. IT-2,pp. 102–108, 1956.

[61] , “Some fundamental problems in the approximate and exactrepresentation of functions of one or several variables,” inProc III.Math Congress USSR, vol. 2. Moscow, USSR: MCU Press, 1956, pp.28–29. Reprinted in Komogorov’sSelected Works, vol. I.

[62] A. N. Kolmogorov and V. M. Tikhomirov,�-entropy and�-capacity.Usp. Mat. Nauk, vol. 14, pp. 3–86, 1959. (English transl.Amer. Math.Soc. Transl., ser. 2, vol 17, pp. 277–364.)

[63] H. P. Kramer and M. V. Mathews, “A linear coding for transmittinga set of correlated signals,”IRE Trans. Inform. Theory, vol. IT-23, pp.41–46, Sept. 1956.

[64] P. G. Lemarie and Y. Meyer, “Ondelettes et Bases Hilbertiennes,”Revista Mat. Iberomamericana, vol. 2, pp. 1–18, 1986.

[65] S. LoPresto, K. Ramchandran, and M. T. Orchard, “Image coding basedon mixture modeling of wavelet coefficients and a fast estimation-quantization framework,” inData Compression Conf. ’97(Snowbird,UT, 1997), pp. 221–230.

[66] G. G. Lorentz, “Metric entropy and approximation,”Bull. Amer. Math.Soc., vol. 72, pp. 903–937, Nov. 1966.

[67] L. Le Cam, “Convergence of estimates under dimensionality restric-tions,” Ann. Statist., vol. 1, pp. 38–53, 1973.

[68] A. S. Lewis and G. Knowles, “Image compression using the 2-D wavelettransform,” IEEE Trans. Image Processing, vol. 1, pp. 244–250, Apr.1992.

[69] S. Mallat, “A theory for multiresolution signal decomposition: Thewavelet representation,”IEEE Trans. Pattern Anal. Machine Intell., vol.11, pp. 674–693, July 1989.

[70] S. Mallat and F. Falzon, “Analysis of low bit rate image transformcoding,” IEEE Trans. Signal Processing, vol. 46, pp. 1027–1042, Apr.1998.

[71] H. S. Malvar and D. H. Staelin, “The LOT: Transform coding withoutblocking effects,”IEEE Trans. Acoust., Speech, Signal Processing, vol.37, no. 4, pp. 553–559, 1989.


[72] H. S. Malvar, “Extended lapped transforms: Properties, applications, andfast algorithms,”IEEE Trans. Signal Processing, vol. 40, pp. 2703–2714,1992.

[73] Y. Meyer, “Review of ‘An Introduction to Wavelets’ and ‘Ten Lectureson Wavelets’,”Bull. Amer. Math. Soc., vol. 28, pp. 350–359, 1993.

[74] , Ondelettes et Operateurs. Paris, France: Hermann, 1990.[75] , “Wavelets and applications” (Lecture at CIRM Luminy meeting,

Luminy, France, Mar. 1992).[76] F. Mintzer, “Filters for distortion-free two-band multirate filter banks,”

IEEE Trans. Acoust., Speech, Signal Processing, vol. 33, pp. 626–630,June 1985.

[77] A. Pinkus,n-Widths in Approximation Theory. New York: Springer,1983.

[78] G. Pisier,The Volume of Convex Bodies and Banach Space Geometry.Cambridge, U.K.: Cambridge Univ.Press, 1989.

[79] J. O. Ramsay and B. W. Silverman,Functional Data Analysis. NewYork: Springer, 1997.

[80] K. Ramchandran and M. Vetterli, “Best wavelet packet bases in a rate-distortion sense,”IEEE Trans. Image Processing, vol. 2, pp. 160–175,Apr. 1993.

[81] D. L. Ruderman, “Origins of scaling in natural images”VISION RES.,vol. 37, no. 23, pp. 3385–3398, Dec. 1997.

[82] A. Said and W. A. Pearlman, “A new, fast, and efficient image codecbased on set partitioning in hierachical trees,”IEEE Trans. Circuits Syst.Video Technol., vol. 6, pp. 243–250, June 1996.

[83] C. E. Shannon, “A mathematical theory of communication,”Bell Syst.Tech. J., vol. 27, pp. 379–423, 623–656, 1948.

[84] , “Communication in the presence of noise,”Proc. IRE, vol. 37,pp. 10–21, 1949.

[85] , “The bandwagon” (1956 ), inClaude Elwood Shannon: Col-lected Papers, N. J. A. Sloane and A. D. Wyner, Eds. Piscataway, NJ:IEEE Press, 1993.

[86] J. M. Shapiro, “Embedded image coding using zerotrees of waveletcoefficients,” IEEE Trans. Signal Processing, vol. 41, pp. 3445–3462,Dec. 1993.

[87] E. Simoncelli, “Statistical models for images: Compression, restorationand synthesis,”presented at the IEEE 31st Asilomar Conf. Signals,Systems, and Computers, Pacific Grove, CA, 1997.

[88] M. J. T. Smith and T. P. Barnwell, III, “Exact reconstruction fortree-structured subband coders,”IEEE Trans. Acoust., Speech, Signal

Processing, vol. 35, pp. 431–441, 1986.[89] E. Stein,Singular Integrals and Differentiability Properties of Functions.

Princeton, NJ: Princeton Univ. Press, 1970.[90] , Harmonic Analysis: Real Variable Methods, Orthogonality, and

Oscillatory Integrals. Princeton, NJ: Princeton Univ. Press, 1993.[91] J. O. Stromberg,Festschrift in Honor of Antoni Zygmund. Monterey,

CA: Wadsworth, 1982.[92] , “Maximal functions, Hardy Spaces, BMO, and Wavelets, from

my point of view,” in Festschrift in Honor of Lennart Carleson andYngve Domar, (Acta Universitatis Upsaliensisvol. 58). Stockholm,Sweden: Almquist and Wiksell Int., 1994.

[93] V. M. Tikhomirov, “Widths and entropy,”Usp. Mat. Nauk, vol. 38, pp.91–99, 1983 (in Russian). English transl. inRussian Math Surveys, vol.38, pp. 101–111.

[94] , “Commentary:�-entropy and�-capacity,” inA. N. Kolmogorov:Selected Works. III. Information Theory and Theory of Algorithms, A.N. Shiryaev, Ed. Boston, MA: Kluwer, 1992.

[95] B. Torresani, “Time-frequency representations: Wavelet packets andoptimal decomposition,”Ann. l’Institut Henri Poincare (PhysiqueTheorique), vol. 56, no. 2, pp. 215–34, 1992.

[96] H. Triebel,Theory of Function Spaces. Basel, Switzerland: Birkhauser,1983.

[97] P. P. Vaidyanathan, “Quadrature mirror filter banks, M-band extensionsand perfect reconstruction techniques,”IEEE ASSP Mag., vol. 4, pp.4–20, July 1987.

[98] M. Vetterli, “Multi-dimensional subband coding: Some theory andalgorithms,”Signal Processing, vol. 6, no. 2, pp. 97–112, Apr. 1984.

[99] , “Filter banks allowing perfect reconstruction,”Signal Process-ing, vol. 10, no. 3, pp. 219–244, Apr. 1986.

[100] M. Vetterli and J. Kovacevic, Wavelets and Subband Coding. Engle-wood Cliffs, NJ: Prentice-Hall, 1995.

[101] P. H. Westerink, J. Biemond, D. E. Boekee, and J. W. Woods, “Subbandcoding of images using vector quantization,”IEEE Trans. Commun.,vol. 36, pp. 713–719, June 1988.

[102] J. W. Woods and S. D. O’Neil, “Sub-band coding of images,”IEEETrans. Acoust., Speech Siganl Processing, vol. 34, pp. 1278–1288, Oct.1986.

[103] A. Zygmund, Trigonometric Series, vols. I & II. Cambridge, U.K.:Cambridge Univ. Press, 1959.

data compression and harmonic analysis - information...

Documents