nonlinear multiscale methods for estimation…ipollak/yan_huang_thesis.pdf · nonlinear multiscale...

NONLINEAR MULTISCALE METHODS FOR ESTIMATION,

APPROXIMATION, AND REPRESENTATION OF SIGNALS AND IMAGES

A Thesis

Submitted to the Faculty

of

Purdue University

by

Yan Huang

In Partial Fulfillment of the

Requirements for the Degree

of

Doctor of Philosophy

December 2004

ii

To my parents, Menglin Huang and Kangying Li.

iii

ACKNOWLEDGMENTS

First of all, I wish to express my deep gratitude to my advisor, Professor Ilya

Pollak, for his dedicated guidance, enlightening suggestions and enthusiastic discus-

sions. I was very fortunate to have the opportunity to work with him. I am also

thankful for his offering of a research assistantship, without which I could not have

the opportunity to study at Purdue at all.

I would like to thank my other committee members, Professor Jan P. Allebach,

Professor Charles A. Bouman, and Professor Bradley J. Lucier for their suggestions

and support.

I am thankful for the help and friendship of my officemates, Xiaogang Dong and

Huaili(Wiley) Wang, and my former officemate, Seungseok Oh.

I am grateful to my friends Liyun Wang, Guotong Feng, Feng Lu, Zhen He, and

Maggie for their help during my first semester at Purdue, when I needed help most.

I would also like to thank Yuxin(Zoe) Liu, Hui Peng, Hai Li, Yiran Chen, Rongmei

Zhang, Qingzhou Wang, Yu Ying, Le Cai, Xiaojun Feng, Longbi Lin, Zhen Li, Limin

Liu, Zhi Jiang, Li Xu, Buyue Zhang, Mei Wang, Yue Lei, Wenti(Wendy) Xu, and

Peng Cheng, and many other friends. They made my life at Purdue an enjoyable

experience and their sincere friendship was invaluable to me.

I am most grateful to my parents. They have always loved and supported me. I

know they are most happy and proud for me when I am getting this degree. So I

dedicate this thesis to them.

Special thanks to my daughter, Jia Erin Ni(now one month old), for the great

joy and happiness she brought to me.

Last but not least, I thank my dear husband, Bin Ni, for his love, understanding

and support. Without him, I could not have completed this thesis.

iv

This work was supported in part by a National Science Foundation (NSF) CA-

REER award CCR-0093105, an NSF grant IIS-0329156, and a Purdue Research

Foundation grant 6903570.

v

TABLE OF CONTENTS

Page

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

1 ESTIMATION PROBLEMS WITH NONLINEAR EVOLUTION EQUA-TIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Comparisons to the Original Method of Rudin-Osher-Fatemiin 1-D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Experiments in 2-D . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Background, Notation, and Definitions . . . . . . . . . . . . . . . . . 8

1.3.1 Nonlinear Diffusions . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.2 Notation in 1-D . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3.3 A SIDE in 1-D . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 Basic Properties of the 1-D SIDE . . . . . . . . . . . . . . . . . . . . 16

1.5 Optimal Estimation in 1-D . . . . . . . . . . . . . . . . . . . . . . . . 20

1.5.1 ML Estimation with a TV Constraint . . . . . . . . . . . . . . 20

1.5.2 The Bouman-Sauer Problem . . . . . . . . . . . . . . . . . . . 26

1.5.3 The Rudin-Osher-Fatemi Problem . . . . . . . . . . . . . . . . 27

1.5.4 Adaptive Stopping . . . . . . . . . . . . . . . . . . . . . . . . 29

1.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2 BEST BASIS SEARCH IN LAPPED DICTIONARIES . . . . . . . . . . . 31

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 Local Cosine Decompositions . . . . . . . . . . . . . . . . . . . . . . 33

vi

Page

2.2.1 Best Basis Search Problem . . . . . . . . . . . . . . . . . . . . 33

2.2.2 A Local Cosine Dictionary . . . . . . . . . . . . . . . . . . . . 33

2.2.3 A Best Basis Algorithm . . . . . . . . . . . . . . . . . . . . . 35

2.2.4 Shift-Invariance: A Qualitative Discussion . . . . . . . . . . . 37

2.2.5 A Strictly Shift-Invariant Algorithm . . . . . . . . . . . . . . . 39

2.2.6 Examples with the Entropy Cost . . . . . . . . . . . . . . . . 42

2.2.7 Frequency-Domain Local Cosines . . . . . . . . . . . . . . . . 43

2.2.8 Noise Removal Examples . . . . . . . . . . . . . . . . . . . . . 46

2.3 Further Extensions of the Basic Algorithm . . . . . . . . . . . . . . . 47

2.3.1 Extension 1, Min-M: Allowing Arbitrary Positions for Windows 47

2.3.2 Extension 2: Blocks Algorithm . . . . . . . . . . . . . . . . 49

2.3.3 Extension 3: Overlapping-Blocks Algorithm . . . . . . . . 51

2.4 Best Basis Search in Lapped Dictionaries . . . . . . . . . . . . . . . . 54

2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3 FAST SEARCH FOR BEST REPRESENTATIONS IN MULTITREE DIC-TIONARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2 Example 1: Optimal Rectangular Tilings . . . . . . . . . . . . . . . . 58

3.2.1 A Fast Recursive Tiling Algorithm . . . . . . . . . . . . . . . 58

3.2.2 A Simple Cost Function . . . . . . . . . . . . . . . . . . . . . 63

3.2.3 Reducing the Computational Complexity . . . . . . . . . . . . 64

3.3 Example 2: Optimal Wedgelet Tilings . . . . . . . . . . . . . . . . . . 65

3.3.1 Algorithm Extension 1: State Variables . . . . . . . . . . . . . 65

3.3.2 Wedgelet Experiments . . . . . . . . . . . . . . . . . . . . . . 67

3.4 Further Extensions of the Optimal Tiling Algorithm . . . . . . . . . . 68

3.4.1 Algorithm Extension 2: Incorporating Internal Nodes into theCost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.4.2 Algorithm Extension 3: Dynamic Programming Over a Se-quence of Blocks . . . . . . . . . . . . . . . . . . . . . . . . . 71

vii

Page

3.5 Example 3: Multitree Image Coding Algorithm . . . . . . . . . . . . 71

3.5.1 Compression Experiments . . . . . . . . . . . . . . . . . . . . 73

3.6 Multitree Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.7 Relationships with Prior Work . . . . . . . . . . . . . . . . . . . . . . 83

3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Appendix A: Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . 91

A.1 Proof of Four Auxiliary Lemmas . . . . . . . . . . . . . . . . . 91

A.2 Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . 95

Appendix B: A Suboptimal Strictly Shift-Invariant Algorithm . . . . . . . 96

Appendix C: Time Complexity of Various Algorithms . . . . . . . . . . . . 98

VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

viii

LIST OF TABLES

Table Page

1.1 A comparison of computation times for 50 1024-point signals. . . . . . . 5

1.2 Table of notation for the segmentation parameters of a signal. . . . . . . 13

2.1 Comparison of dyadic and mod-M method with noise level σ = 150(SNR=8.22db). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.2 Running times and costs for dyadic and mod-M algorithms. . . . . . . . 52

2.3 Running times and costs for the (a) blocks algorithm and (b) overlapping-

blocks algorithm, each used with the mod-M algorithm. . . . . . . . . 53

ix

LIST OF FIGURES

Figure Page

1.1 An experiment in 1-D: Noise removal via constrained total variation min-imization using our algorithm and algorithm in [1]. For this 10000-pointsignal, with comparable reconstruction quality and RMS errors, our al-gorithm takes 0.29 seconds on a SunBlade 1000 (750 MHz processor),whereas the algorithm of [1], with the most favorable parameter settings,takes 1005 seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 An experiment in 1-D. (a) RMS errors, as a function of noise standard de-viation σ, for our basic constrained total variation minimization methodwhich uses the knowledge of σ (solid line) and for a variant of the methodwhich does not rely on knowing σ but uses an adaptive stopping rule in-stead (dotted line). (b) A piecewise constant signal with additive whiteGaussian noise of σ = 30. (c) Restoration using our constrained totalvariation minimization method with known σ (solid line) superimposedonto the noise-free signal (dotted line). (d) Restoration using our adap-tive stopping rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 An experiment in 2-D: Noise removal via constrained total variation min-imization using our algorithm and algorithm in [1]. With comparablereconstruction quality and RMS errors, our algorithm takes 14 seconds(on a SunBlade 1000 750MHz processor), whereas the algorithm of [1]takes 151 seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Another 2-D experiment. With comparable reconstruction quality andRMS errors, our algorithm takes 14 seconds (on a SunBlade 1000 750MHzprocessor), whereas the algorithm of [1] takes 54 seconds. . . . . . . . . . 9

1.5 Zooming in on the tank experiment: (a) two patches; (b-e) enlargedversions of the images in Fig. 1.3 for the first patch; (f-i) enlarged versionsof the images in Fig. 1.3 for the second patch. . . . . . . . . . . . . . . . 10

1.6 An illustration of the evolution of SIDE: (a) a mesh plot of the solutionu(t) as a function of time t and index n, with hitting times t1 = 3.2,t2 = 6.54, t3 = 9.21, t4 = tf = 15.53; (b) plots of the initial signal (top)and the solution at the four hitting times. . . . . . . . . . . . . . . . . . 15

1.7 Left: The total variation of the solution, as a function of time, for theevolution of Fig. 1.6. Right: Its time derivative. . . . . . . . . . . . . . . 18

x

Figure Page

2.1 A window function βu,v and an element of a local cosine dictionary. . . . 34

2.2 Pseudocode specification of a fast dynamic programming algorithm forthe best local cosine basis search. The cost of the best basis Ou,N isdenoted by C∗u,N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.3 The original signals and time-frequency representations of the best localcosine basis with smallest cell size M = 16: (a) a signal consisting of twolocal cosine basis functions; (b) the time-frequency tiling for the best localcosine basis of [2, 3]; (c) the time-frequency tiling for the shift-invariantlocal cosine decomposition [4]; (d) the time-frequency tiling for the bestmod-M local cosine basis; (e-h) a similar experiment for the signal in (a)shifted by 16 samples; (i-l) a similar experiment for a signal where thetwo local cosine bumps are shifted by different amounts. The darker therectangle in (b-d,f-h,j-l) the larger the amplitude of the correspondinglocal cosine coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.4 Two signals and the time-frequency pictures of their best bases: (a) seg-ment “grea” of the speech signal “greasy”; (b,c) the time-frequency tilingsfor the best local cosine basis of [2] and for the best mod-M local cosinebasis, respectively; (d-f) a similar experiment for a shorter segment of thespeech signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.5 The performance of four algorithms for extracting the best local cosinebasis: dyadic [2] (dotted), shift-invariant LCD [4] (dashdot), the proposedmod-M algorithm (solid), and the proposed shift-invariant version of themod-M algorithm (dashed). The optimal cost is depicted as a functionof the minimal allowed cell size: (a) 4096-point “grea” speech signal, (b)512-point segment of the signal. . . . . . . . . . . . . . . . . . . . . . . . 41

2.6 Top row: two local cosine functions. Bottom row: two functions from thefrequency-domain local cosine dictionary obtained by taking the inverseDCT-IV of the functions in the top row. . . . . . . . . . . . . . . . . . . 43

2.7 (a) A noisy speech signal; (b) its DCT; (c) noise-free speech signal; (d) itsDCT; (e) the basis vector from the best mod-M local cosine basis whoseinner product with the signal in (a) is the largest; (f) its DCT; (g) thebasis vector from the best mod-M FDLC basis whose inner product withthe signal in (a) is the largest; (h) its DCT. . . . . . . . . . . . . . . . . 45

xi

Figure Page

2.8 Best basis thresholding with dyadic and mod-M local cosine dictionariesin time domain and in frequency domain. The second row shows variousestimates of the signal (a) based on its noisy observation (c), and thethird row shows the corresponding tilings of the time-frequency plane.From left to right: (e,i) dyadic local cosine dictionary; (f,j) mod-M localcosine dictionary; (g,k) dyadic frequency-domain local cosine dictionary;(h,l) mod-M frequency-domain local cosine dictionary. . . . . . . . . . . 48

2.9 Pseudocode specification of a fast dynamic programming algorithm forthe best-basis search in a lapped dictionary. . . . . . . . . . . . . . . . . 54

3.1 An illustration of tilings and sequences of splits. (a) An admissibletiling—i.e., a tiling that can be obtained via a sequence of binary splits.(b) An inadmissible tiling. (c) A sequence of splits that leads to the tilingin (a). (d) Another sequence of splits that leads to the tiling in (a). . . . 59

3.2 Pseudocode specification of a fast recursive search for the best rectangulartiling: (a) the recursive calculation of the optimal left children s∗P and thecorresponding costs C∗P ; (b) the recursive generation of the best tiling.It is assumed that both routines have access to the same global datastructure Table. The optimal tiling B∗Q of an image domain Q is obtainedwith (C∗Q, s∗Q)← best split v0(Q), followed by B∗Q ← best tiling v0(Q). . 62

3.3 A 256 × 256 cameraman image and its best rectangular tilings with thesmallest cell size 16× 16: (b) best dyadic tiling, cost 0.57; (c) best arbi-trary tiling, cost 0.44. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4 A wedgelet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.5 Two best wedgelet tiling examples for an 128 × 128 binary image: (a)Quadtree wedgelets, SNR=17.1 dB, rate = 0.0062 bits per pixel; (b)Dyadic wedgelets, SNR = 17.8 dB at 0.0055 bits per pixel. Panel (c)shows the rate-distortion curves for this image, for the quadtree wedgelets(dashed) and the dyadic wedgelets (solid). . . . . . . . . . . . . . . . . . 66

3.6 Pseudocode for the recursive calculation of the optimal splits and statesand the corresponding costs for cost2 of Section 3.4.1. . . . . . . . . . . 70

3.7 Pseudocode for the recursive generation of the best tree for Section 3.4.1. 70

3.8 Pseudocode for the dynamic programming over blocks, Section 3.4.2. . . 72

3.9 Rate-Distortion curves for “goldhill”(top left), “barbara” (top right),“lenna” (bottom left), and “cameraman” (bottom right). . . . . . . . . . 74

xii

Figure Page

3.10 Results for the “barbara” image at PSNR = 36.4 dB: (a) original image,(b) JPEG (rate = 1.31 bits per pixel), (c) quadtree compression (rate =1.00 bits per pixel), and (d) multitree compression (rate = 0.83 bits perpixel). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.11 Results for the “barbara” image at the bit rate of 0.49 bits per pixel: (a)a patch of the original image, (b) JPEG (PSNR for the overall image =28.3 dB), (c) quadtree compression (PSNR = 30.5 dB), and (d) multitreecompression (PSNR = 31.9 dB). . . . . . . . . . . . . . . . . . . . . . . . 77

3.12 Rate-Distortion curves for “goldhill”(top row), “barbara” (second row),“lenna” (third row), and “cameraman” (bottom row). The right columnshows bit rates as percentages of the bit rate for the multitree algorithmwith arithmetic coding of the coefficients. . . . . . . . . . . . . . . . . . . 78

3.13 Pseudocode for the recursive calculation of the best splits and best costs,and for the recursive generation of the globally optimal tree. . . . . . . . 82

xiii

ABSTRACT

Huang, Yan. Ph.D., Purdue University, December, 2004. Nonlinear MultiscaleMethods for Estimation, Approximation, and Representation of Signals and Images.Major Professor: Ilya Pollak.

We cover two topics in the broad area of nonlinear multiscale methods. In the first

topic, we develop computationally efficient procedures for solving certain restoration

problems in 1-D, including the discrete versions of the total variation regularized

problem and the constrained total variation minimization problem. They are based

on a simple nonlinear diffusion equation and related to the Perona-Malik equation. A

probabilistic interpretation for this diffusion equation in 1-D is provided by showing

that it produces optimal solutions to a sequence of estimation problems. We extend

our methods to 2-D where they no longer have similar optimality properties; however,

we experimentally demonstrate their effectiveness for image restoration.

In the second topic we introduce a new framework of multitree dictionaries and

propose new algorithms for efficiently finding the best representation in a multitree

dictionary. We apply our framework to develop novel dynamic programming algo-

rithms for finding the best basis in a dictionary of arbitrary lapped bases in 1-D. We

illustrate this using a non-dyadic local cosine dictionary, and show that the resulting

representations are more compact and are characterized by lower costs and approxi-

mate shift-invariance. We also provide an algorithm which is strictly shift-invariant

and several accelerated versions of the basic algorithm which explore various trade-

offs between computational efficiency and adaptability. A novel dictionary which

constructs the best local cosine representation in the frequency domain is proposed

and shown to be better suited for representing certain types of signals. We apply our

framework in 2-D to develop novel tree-pruning algorithms for finding the best basis

xiv

in an arbitrary multitree dictionary. We illustrate our framework through several

examples, including a novel block image coder which significantly outperforms both

the standard JPEG and quadtree-based methods, and is comparable to embedded

coders such as JPEG2000 and SPIHT.

1

1. ESTIMATION PROBLEMS WITH NONLINEAR

EVOLUTION EQUATIONS

1.1 Introduction

The recent years have seen a great number of exciting developments in the field of

image reconstruction, restoration, and segmentation via nonlinear diffusion filtering

(see, e.g., a survey article [5]). Since the objective of these filtering techniques is

usually extraction of information in the presence of noise, their probabilistic inter-

pretation is important. In particular, a natural question to consider is whether or

not these methods solve standard estimation or detection problems. An affirmative

answer would help us understand which technique is suited best for a particular ap-

plication, and aid in designing new algorithms. It would also put the tools of the

classical detection and estimation theory at our disposal for the analysis of these

techniques, making it easier to tackle an even more crucial issue of characterizing

the performance of a nonlinear diffusion technique given a probabilistic noise model.

Addressing the relationship between nonlinear diffusion filtering and optimal es-

timation is, however, very difficult, because the complex nature of the nonlinear

partial differential equations (PDEs) used in these techniques and of the images of

interest make this analysis prohibitively complicated. Some examples of the exist-

ing literature on the subject are [6, 7] which establish qualitative relations between

the Perona-Malik equation [8,9] and gradient descent procedures for estimating ran-

dom fields modeled by Gibbs distributions. Bayesian ideas are combined in [10]

with snakes and region growing for image segmentation. In [11], concepts from ro-

bust statistics are used to modify the Perona-Malik equation. In [12], a connection

between random walks and diffusions is used to obtain a new evolution equation.

2

One of the goals of this chapter is to move forward the discussion of this issue by

establishing that a particular nonlinear diffusion filtering method results in a maxi-

mum likelihood (ML) estimate for a certain class of signals. We expand and develop

the contributions of [13, 14], obtaining new methods for solving certain restoration

problems. The methods are first developed in 1-D, where they are provably fast, and

provably optimal. While we do not have analytical results on the 2-D generalizations

of our methods, experiments show that the 2-D algorithms are efficient and robust,

as well.

We concentrate most of our attention on the problem of maximum likelihood

estimation in additive white Gaussian noise, subject to a constraint on the total

variation (TV). We show that this problem, in 1-D, is closely related to the total

variation minimization problems posed by Bouman and Sauer in [15, 16] and by

Rudin, Osher, and Fatemi in [1]. This choice of our prototypical problem is motivated

by a great success of total variation minimization methods [1, 15–19], which has

demonstrated a critical need for fast computational techniques [20–24]. A major

contribution of this chapter is a new fast and exact algorithm for solving the 1-D

discrete-time versions of these problems, as well as a fast and approximate algorithm

for the 2-D versions.

In order to motivate our theoretical results and the ensuing algorithms, we start

in Section 1.2 by experimentally demonstrating the effectiveness of the methods

proposed in this chapter. We proceed with a review of background material on

nonlinear diffusions in Section 1.3, and then focus on one very simple evolution in

Section 1.4. We show in Section 1.5 that this evolution results in a fast solver of our

ML problem. Section 1.6 summarizes the contributions of this chapter.

3

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

−20

0

20

40

60

80

100

120

140

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

−20

0

20

40

60

80

100

120

140

(a) Noise-free signal. (b) Noisy signal.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

−20

0

20

40

60

80

100

120

140

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

−20

0

20

40

60

80

100

120

140

(c) Our reconstruction. (d) Reconstruction using [1].

Fig. 1.1. An experiment in 1-D: Noise removal via constrained totalvariation minimization using our algorithm and algorithm in [1]. Forthis 10000-point signal, with comparable reconstruction quality andRMS errors, our algorithm takes 0.29 seconds on a SunBlade 1000(750 MHz processor), whereas the algorithm of [1], with the mostfavorable parameter settings, takes 1005 seconds.

1.2 Experimental Results

1.2.1 Comparisons to the Original Method of Rudin-Osher-Fatemi in

1-D

The algorithms developed in this chapter address the problem of noise removal

both in 1-D and 2-D. In particular, the 1-D algorithm described below in Subsec-

4

tion 1.5.3, calculates the solution to the constrained total variation minimization

problem originally posed by Rudin, Osher, and Fatemi in [1]. Fig. 1.1 illustrates

the comparison of our method to the original method of [1]. Fig. 1.1(a) shows a

synthetic piecewise constant signal. The same signal, with additive zero-mean white

Gaussian noise of standard deviation σ = 10, is shown in Fig. 1.1(b). This signal

is processed using our algorithm, and the resulting output–obtained in 0.29 seconds

on a SunBlade 1000–is shown in Fig. 1.1(c). Note that the only parameter that our

algorithm needs is σ. In this example, we assume that we know σ exactly.

Since the method of [1] is an iterative descent method, no straightforward compar-

ison is possible. Indeed, in addition to σ, that algorithm needs two other parameters:

the time step and the stopping criterion. Choosing the time step a priori is a diffi-

cult task. Generally, the noisier the data, the larger the number of iterations, and

therefore smaller time steps will be required for noisier data to achieve a given pre-

cision. To make the comparison as tough for our own method as possible, we chose

the largest time step for the algorithm of [1] that can result in the reconstruction

quality comparable to that of our method. Specifically, the parameters were chosen

to produce an RMS error per sample of 1.2 (as compared to 0.9 for our algorithm).

We also assume here that the correct value of σ is known. The result, for the input

signal of Fig. 1.1(b), is shown in Fig. 1.1(d). It took 1005 seconds to compute the

solution–in excess of three orders of magnitude more computation time than our

method.

The advantages of our method illustrated with this simple example also apply

when comparing it to other, more recent approaches to solving the TV minimization

problem or related problems (e.g., [1, 15, 17, 20–25]). An important advantage of

our method is that a bound on the overall computational complexity is available.

Thanks to Proposition 7 of Section 1.5 below, our 1-D algorithm is guaranteed to

achieve the solution of the Rudin-Osher-Fatemi problem exactly (modulo computer

precision), in a finite number of steps which is O(N log N).

5

In addition, our method relies on only one parameter, σ. Moreover, as illustrated

below in Fig. 1.2, the requirement to know σ–which tells our recursive algorithm

when to stop–can be dispensed with and replaced with an adaptive stopping rule,

with essentially no sacrifice in performance.

Table 1.1A comparison of computation times for 50 1024-point signals.

Mean computation time Standard deviation Fastest time Slowest time

Proposed algorithm 0.06 0.004 0.05 0.07

Algorithm of [1] 6.96 6.09 1.11 24.32

To further compare our method with that of [1] in 1-D, we refer to Table 1.1 which

details the results of a Monte-Carlo simulation. Fifty randomly generated piecewise

constant signals of length N = 1024 with additive white Gaussian noise (SNR = 1)

are used in the experiment. For each signal, we run our algorithm and calculate the

RMS error e. We then try the descent algorithm of [1] with several different time

steps, and choose the time step for which the algorithm can reach the RMS error of

1.1e (i.e., be within 10% of our RMS error) with the fewest iterations. (If it cannot

reach 1.1e in 150, 000 iterations–this happened in 3 out of 50 cases–we stop.) We

then look at the computation time for the algorithm of [1], with the optimal time

step. The results of this experiment are summarized in Table 1.1, showing once again

that our 1-D method is much faster: the fastest run of the algorithm of [1] is more

than an order of magnitude slower than the slowest run of our algorithm.

We now test an adaptive version of our algorithm, outlined below in Section 1.5.

Our algorithm is a recursive procedure: when the standard deviation σ of the noise is

known, this parameter can be used to stop the recursion. If, however, σ is unknown,

we can apply a different, adaptive, criterion, to determine when to stop. As Fig. 1.2

shows, the RMS errors for the adaptive algorithm are very similar to those for the

case when the correct value of σ is known. Moreover, since we use the same basic

6

10 20 30 40 500

2

4

6

8

10

12

14

σ

RM

S e

rror

basic algorithmadaptive stopping rule

0 100 200 300 400 500 600 700 800 900 1000−150

−100

−50

0

50

100

150

200

(a) Monte-Carlo simulations. (b) A noisy signal.

0 100 200 300 400 500 600 700 800 900 1000−150

−100

−50

0

50

100

150

200

0 100 200 300 400 500 600 700 800 900 1000−150

−100

−50

0

50

100

150

200

(c) Restored via basic algorithm. (d) Adaptive stopping rule.

Fig. 1.2. An experiment in 1-D. (a) RMS errors, as a function ofnoise standard deviation σ, for our basic constrained total variationminimization method which uses the knowledge of σ (solid line) andfor a variant of the method which does not rely on knowing σ butuses an adaptive stopping rule instead (dotted line). (b) A piece-wise constant signal with additive white Gaussian noise of σ = 30.(c) Restoration using our constrained total variation minimizationmethod with known σ (solid line) superimposed onto the noise-freesignal (dotted line). (d) Restoration using our adaptive stoppingrule.

7

procedure to calculate the solution, the computational complexity of the adaptive

method is the same.

1.2.2 Experiments in 2-D

While our 1-D algorithms extend to 2-D, our main theoretical results do not.

Propositions 5, 6, and 7 of Section 1.5 do not hold in 2-D–i.e. we can no longer claim

that our algorithm exactly solves the respective optimization problems in 2-D. More-

over, it can be shown that the time complexity of the 2-D version of our algorithm

is actually O(N2 log N), not O(N log N), where N is the total number of pixels in

the image. Our algorithm nevertheless solves these problems approximately and still

has a bound on computational complexity. We have also observed that its actual

time complexity is typically lower than the asymptotic bound of O(N2 log N). To

illustrate our algorithm, we use two images: one image from [1], shown in Fig. 1.3(a),

and another image shown in Fig. 1.4(a). The two images, corrupted by zero-mean

additive white Gaussian noise are shown in Figs. 1.3(b) and 1.4(b), respectively. The

noise variance for the tank image is chosen so as to achieve the same SNR as in the

original paper [1]. The outputs of our algorithm1 are in Figs. 1.3(c) and 1.4(c). The

computation time is approximately 14 seconds in both cases. We use the parameter

settings in the algorithm of [1] to achieve a comparable RMS error, in both cases,

and the results are depicted in Figs. 1.3(d) and 1.4(d). The computation times were

151 seconds for the tank image and 54 seconds for the airplane image. To better

illustrate the reconstructions, we provide in Fig. 1.5 enlarged versions of two patches

from the images of Fig. 1.3. Fig. 1.5(a) illustrates where the two patches are taken

from. Fig. 1.5(b-e) is the first patch: original, noisy, restored with our method, and

restored with the method of [1], respectively. Fig. 1.5(f-i) contains similar images

for the second patch. The visual quality of the two reconstructions is similar; both

methods are good at recovering edges and corners.

1We only show the outputs of the adaptive version of our algorithm since the results when the noise variance is

known are similar.

8

100 200 300 400 500

50

100

150

200

250

300

350

400

450

500100 200 300 400 500

50

100

150

200

250

300

350

400

450

500

(a) An image. (b) Noisy image.

100 200 300 400 500

50

100

150

200

250

300

350

400

450

500100 200 300 400 500

50

100

150

200

250

300

350

400

450

500


Fig. 1.3. An experiment in 2-D: Noise removal via constrained to-tal variation minimization using our algorithm and algorithm in [1].With comparable reconstruction quality and RMS errors, our algo-rithm takes 14 seconds (on a SunBlade 1000 750MHz processor),whereas the algorithm of [1] takes 151 seconds.

1.3 Background, Notation, and Definitions

1.3.1 Nonlinear Diffusions

The basic problem considered in this chapter is restoration of noisy 1-D signals

and 2-D images. We build on the results in [26], where a family of systems of ordinary

9

100 200 300 400 500

50

100

150

200

250

300

350

400

450

500100 200 300 400 500

50

100

150

200

250

300

350

400

450

500

(a) An image. (b) Noisy image.

100 200 300 400 500

50

100

150

200

250

300

350

400

450

500100 200 300 400 500

50

100

150

200

250

300

350

400

450

500


Fig. 1.4. Another 2-D experiment. With comparable reconstructionquality and RMS errors, our algorithm takes 14 seconds (on a Sun-Blade 1000 750MHz processor), whereas the algorithm of [1] takes54 seconds.

differential equations, called Stabilized Inverse Diffusion Equations (SIDEs), was

proposed for restoration, enhancement, and segmentation of signals and images. The

(discretized) signal or image u0 to be processed is taken to be the initial condition for

the equation, and the solution u(t) of the equation provides a fine-to-coarse family

10

100 200 300 400 500

50

100

150

200

250

300

350

400

450

500

(a)

(b) A patch. (c) Noisy patch. (d) Our reconstruction. (e) Reconstruction using [1].

(f) A patch. (g) Noisy patch. (h) Our reconstruction. (i) Reconstruction using [1].

Fig. 1.5. Zooming in on the tank experiment: (a) two patches; (b-e)enlarged versions of the images in Fig. 1.3 for the first patch; (f-i)enlarged versions of the images in Fig. 1.3 for the second patch.

11

of segmentations of the image. This family is indexed by the “scale” (or “time”)

variable t, which assumes values from 0 to ∞.

The usefulness of SIDEs for segmentation was shown in [26]; in particular, it was

experimentally demonstrated that SIDEs are robust to noise outliers and blurring.

They are faster than other image processing algorithms based on evolution equations,

since region merging reduces the dimensionality of the system during evolution.

In [27], SIDEs were successfully incorporated as part of an algorithm for segmenting

dermatoscopic images.

1.3.2 Notation in 1-D

In this subsection, we introduce notation which will be used throughout the

chapter to analyze our 1-D SIDE.

The number of samples in the signals under consideration is always denoted by

N . The signals themselves are denoted by boldface lowercase letters and viewed as

vectors in RN . The samples of a signal are denoted by the same letter as the signal

itself, but normal face, with subscripts 1 through N , e.g., u = (u1, . . . , uN)T . We

say that un is the intensity of the n-th sample of u.

A set of the indexes {n, n + 1, . . . , k} of consecutive samples of a signal u which

have equal intensities, un = un+1 = . . . = uk, is called a region if this set cannot be

enlarged–in other words, if un−1 6= un (or n = 1) and uk 6= uk+1 (or k = N). Any

pair of consecutive samples with unequal intensities is called an edge. The number

of distinct regions in a signal u is denoted by p(u). The indexes of the left endpoints

of the regions are denoted by ni(u), i = 1, . . . , p(u) (the ni’s are ordered from left to

right, n1(u) < n2(u) < . . . < np(u)(u)); the intensity of each sample within region i

is denoted by µi(u), and is referred to as the intensity of region i. This means that

n1(u) is always 1, and that

µi(u) = uni(u) = uni(u)+1 = . . . = uni+1(u)−1, for i = 1, . . . , p(u),

where we use the convention np(u)+1 = N + 1.

12

The length mi(u) of the i-th region of u (i.e., the number of samples in the region)

satisfies:

mi(u) = ni+1(u)− ni(u).

Two regions are called neighbors if they have consecutive indexes. The number of

neighbors of the i-th region of u is denoted by ρi(u), and is equal to one for the

leftmost and rightmost regions, and two for all other regions:

ρi(u) =

1, i = 1, p(u),

2, otherwise.

We call region i a maximum (minimum) of a signal u if its intensity µi(u) is larger

(smaller) than the intensities of all its neighbors.2 Region i is an extremum if it is

either a maximum or a minimum. We let

βi(u) =

1, if region i is a maximum of u,

−1, if region i is a minimum of u,

0, otherwise.

The parameter p(u) and the four sets of parameters ni(u), mi(u), ρi(u), and βi(u),

for i = 1, . . . , p(u), are crucial to both the analysis of our 1-D algorithms and the

description of their implementation. Collectively, these parameters will be referred

to as segmentation parameters of signal u. They are summarized in Table 1.2. When

it is clear from the context which signal is being described by these parameters, we

will drop their arguments, and write, for example, ni instead of ni(u).

The total variation of a signal u ∈ RN is defined by:

TV (u)def=

N−1∑n=1

|un+1 − un|,

and ‖ · ‖ stands for the usual Euclidean (`2) norm:

‖u‖2 def=

N∑n=1

u2n.

2The term “local maximum (minimum)” would be more appropriate, but we omit the word “local” for brevity.

13

Table 1.2Table of notation for the segmentation parameters of a signal.

Segmentation parameters of signal u Notation

Number of regions. p(u)

Index of the left endpoint of region i. ni(u)

Size of region i. mi(u)

Number of neighbors of region i. ρi(u)

Is region i a max, a min, or neither? βi(u)

The following alternative form for TV (u) can be obtained through a simple calcula-

tion:

TV (u) =

p(u)∑i=1

βi(u)ρi(u)µi(u). (1.1)

We are now ready to describe the 1-D SIDE which is analyzed in this chapter.

1.3.3 A SIDE in 1-D

In [13], a probabilistic interpretation was provided for a special case of SIDEs

in 1-D, in the context of binary change detection problems. In the next section, we

generalize these results to restoration problems. Specifically, we are interested in the

evolution of the following system of equations:

uk(t) =1

mi(u(t)){sgn[µi+1(u(t))− µi(u(t))]− sgn[µi(u(t))− µi−1(u(t))]} ,(1.2)

for k = ni(u(t)), ni(u(t)) + 1, . . . , ni+1(u(t))− 1, i = 1, . . . , p(u(t)),

with the initial condition:

u(0) = u0, (1.3)

14

where u0 is the signal to be processed. Note that when i = 1 and when i = p(u(t)),

Eq. (1.2) involves quantities µ0 and µp(u(t))+1 which have not been defined. We use

the following conventions for these quantities:

sgn[µ1(u(t))− µ0(u(t))] = sgn[µp+1(u(t))− µp(u(t))]def= 0. (1.4)

Eq. (1.2) says that the intensities of samples within a region have the same dynamics,

and therefore remain equal to each other. A region cannot therefore be broken into

two or more regions during this evolution. The opposite, however, will happen. As

soon as the intensity of some region becomes equal to that of its neighbor (µj(u(τ)) =

µj+1(u(τ)) for some j and for some time instant τ), the two become a single region–

this follows from our definition of a region. This merging of two regions into one will

express itself in a change of the segmentation parameters of u(t) in Eq. (1.2):

mj(u(τ)) = mj(u(τ−)) + mj+1(u(τ−)), (1.5)

p(u(τ)) = p(u(τ−))− 1, (1.6)

ni(u(τ)) = ni+1(u(τ−)) for i = j + 1, . . . , p(u(τ)), (1.7)

mi(u(τ)) = mi+1(u(τ−)) for i = j + 1, . . . , p(u(τ)). (1.8)

Eq. (1.5) says that the number of samples in the newly formed region is equal to the

total number of samples in the two regions that are being merged. Eq. (1.6) reflects

the reduction in the total number of regions by one. Eqs. (1.7) and (1.8) express the

fact that, since region j + 1 is getting merged onto region j, the region that used to

have index j + 2 will now have index j + 1; the region that used to have index j + 3

will now have index j + 2, etc. Borrowing the terminology from [26], we call such

time instant τ when two regions get merged, a hitting time. Note that between two

consecutive hitting times, the segmentation parameters of u(t) remain constant. We

denote the hitting times by t1, . . . , tp(u0)−1, where t1 is the earliest hitting time and

tp(u0)−1def= tf is the final hitting time:

0 < t1 ≤ t2 ≤ . . . ≤ tp(u0)−1.

15

10

20

30

40

50

15.539.216.543.2

0

1

2

3

Index n

Time (scale) t

u n(t)

10 20 30 40 500

3

u n(0)

10 20 30 40 500

3

u n(3.2

)

10 20 30 40 500

3

u n(6.5

4)

10 20 30 40 500

3

u n(9.2

1)

10 20 30 40 500

3

u n(15.

53)

n

(a) (b)

Fig. 1.6. An illustration of the evolution of SIDE: (a) a mesh plotof the solution u(t) as a function of time t and index n, with hittingtimes t1 = 3.2, t2 = 6.54, t3 = 9.21, t4 = tf = 15.53; (b) plots of theinitial signal (top) and the solution at the four hitting times.

Note that two hitting times can be equal to each other, when more than one pair

of regions are merged at the same time. Proposition 1 of the next subsection shows

that the evolution (1.2,1.3) will reach a constant signal in finite time, starting from

any initial condition. All samples will therefore be merged into one region, within

finite time, which means that there will be exactly p(u0)− 1 hitting times–one fewer

than the initial number of regions.

The rest of the chapter analyzes the SIDE (1.2), with initial condition (1.3),

conventions (1.4), and the ensuing merging rules (1.5-1.8). An example of its evo-

lution, for N = 50, is depicted in Fig. 1.6. Fig. 1.6(a) shows the solution at all

times from t = 0 until t = 20. The initial signal u(0) has p(u(0)) = 5 regions,

and therefore there are four hitting times. They are t1 = 3.2, t2 = 6.54, t3 = 9.21,

t4 = tf = 15.53. The initial condition and the solution at the four hitting times are

16

plotted in Fig. 1.6(b). To illustrate the concept of segmentation parameters, we use

the solution at the first hitting time, u(3.2). Its number of regions is p = 4. The

first region goes from n1 = 1 to n2 − 1 = 20, consists of m1 = n2 − n1 = 20 samples,

has ρ1 = 1 neighbor (namely, the second region), and has intensity µ1 = 0.66. Since

the intensity of its only neighbor is larger, the first region is a minimum: β1 = −1.

The second region goes from n2 = 21 to n3 − 1 = 30, consists of m2 = n3 − n2 = 10

samples, has ρ2 = 2 neighbors (namely, the first and third regions), has intensity

µ2 = 1.5, and is neither a maximum nor a minimum (β2 = 0). The third region goes

from n3 = 31 to n4 − 1 = 40, consists of m3 = n4 − n3 = 10 samples, has ρ3 = 2

neighbors (namely, the second and fourth regions), has intensity µ3 = 2.36, and is

a maximum (β3 = 1). The fourth region goes from n4 = 41 to n5 − 1 = N = 50,

consists of m4 = n5 − n4 = 10 samples, has ρ4 = 1 neighbor (namely, the third

region), has intensity µ4 = 1.32, and is a minimum (β4 = −1).

We conclude this subsection by deriving an alternative form for Eq. (1.2) which

will be utilized many times in the remainder of the chapter. If region i is a maximum

of u(t), Eq. (1.2) and conventions (1.4) say that each of its neighbors contributes

−1/mi to the rate of change of its intensity µi(u(t)). Similarly, if region i is a

minimum, each neighbor contributes 1/mi. If region i is not an extremum, then

it necessarily has two neighbors, one of which contributes 1/mi and the other one

−1/mi. In this latter case, the rate of change of µi(u(t)) is zero. Combining these

considerations, and using our notation from the previous subsection, we obtain an

alternative form for Eq. (1.2):µi = −βiρi

mi

, i = 1, . . . , p, (1.9)

where, to simplify notation, we did not explicitly indicate the dependence of the

segmentation parameters on u(t).

1.4 Basic Properties of the 1-D SIDE

In this section, we study the system (1.2), and prove a number of its properties

which both allow us to gain significant insight into its behavior, and are critical for

17

developing optimal estimation algorithms presented in the next section. The most

basic properties–illustrated in Fig. 1.6 and proved in [26]–assert that the system has

a unique solution which is continuous, and which becomes a constant signal in finite

time.

Throughout this section u(t) stands for the solution of (1.2,1.3), with initial

condition u0. All the segmentation parameters encountered in this section are those

of u(t). The final hitting time is denoted by tf .

Proposition 1 The solution of the SIDE (1.2,1.3) exists, is unique, is a continuous

function of the time parameter t for all values of t, and is a differentiable function

of t for all t except possibly the hitting times. Every pair of neighboring regions is

merged in finite time. After the final hitting time tf , the solution is a constant:

u1(t) = u2(t) = . . . = uN(t) =1

N

N∑n=1

u0n for t ≥ tf .

Proof. See [26].

Our plan is to use the system (1.2) to solve a Gaussian estimation problem with

a TV constraint, as well as related problems originally posed in [1,15,16]. As we will

see in the next section, this necessitates understanding the time behavior of TV (u(t))

and −‖u(t)− u0‖2. Propositions 2 and 3 below show that both these quantities are

Lyapunov functionals of our system (i.e., decrease as functions of time). In the same

propositions, we also derive formulas for computing the time derivatives of these

quantities.

Proposition 2 For t ∈ [0, tf ], the total variation TV (u(t)) is a monotonically de-

creasing function of time, which changes continuously from TV (u(0)) = TV (u0) to

TV (u(tf )) = 0 (see Fig. 1.7). It is a differentiable function of time except at the

hitting times, and its rate of change is:

˙TV (u(t)) = −p∑

i=1

β2i ρ

2i

mi

. (1.10)

18

0 3.2 6.54 9.21 15.530

1

2

3

4

5

6

t

TV(u(t))

0 3.2 6.54 9.21 15.53−0.01

0

t

d/dt[TV(u(t))]

Fig. 1.7. Left: The total variation of the solution, as a function oftime, for the evolution of Fig. 1.6. Right: Its time derivative.

Proof. Using the expression (1.1) for the TV of u(t), differentiating it with respect

to t, and substituting (1.9) for µi results in Eq. (1.10). This identity is valid for all t ∈[0, tf ] where the solution u(t) is differentiable and where the segmentation parameters

are not changing–i.e., for all t except the hitting times. Since, by Proposition 1, the

solution u(t) is a continuous function of t, so is TV (u(t)). We therefore conclude that

TV (u(t)) is a monotonically decreasing, continuous function of time for t ∈ [0, tf ] .

Its value at tf is zero, since, by Proposition 1, u(tf ) is a constant signal.

Proposition 3 Let α(t) = ‖u(t)−u0‖2. Then, for t ∈ [0, tf ], α(t) is a monotonically

increasing function of time, which changes continuously from α(0) = 0 to α(tf ). It

is a differentiable function of time except at the hitting times, and its rate of change

is:

α(t) = 2t

p∑i=1

β2i ρ

2i

mi

. (1.11)

Proof. The following identity is a direct corollary of Lemma 2, proved in the Ap-

pendix:ni+1−1∑k=ni

uk(t) =

ni+1−1∑k=ni

u0k + t(−βiρi). (1.12)

19

We are now ready to show that Eq. (1.11) holds.

d

dt‖u(t)− u0‖2 =

d

dt

N∑n=1

(un(t)− u0n)2 = 2

N∑n=1

(un(t)− u0n)un(t)

Eq. (1.2)= 2

p∑i=1

ni+1−1∑k=ni

(uk(t)− u0k)

(−βiρi

mi

)

Eq. (1.12)= 2

p∑i=1

t(−βiρi)

(−βiρi

mi

)= 2t

p∑i=1

β2i ρ

2i

mi

.

This identity is valid for all t ∈ [0, tf ] where the solution u(t) is differentiable and

where the segmentation parameters are not changing–i.e., for all t except the hitting

times. Since, by Proposition 1, the solution u(t) is a continuous function of t, so

is α(t). We therefore conclude that α(t) is a monotonically increasing, continuous

function of time for t ∈ [0, tf ] .

Corollary 1 Let α(t) = ‖u(t) − u0‖2, and let tl, tl+1 be two consecutive hitting

times. Then

α(t) = α(tl) + (t2 − t2l )

p∑i=1

β2i ρ

2i

mi

, for any t ∈ [tl, tl+1).

Proof. This formula is simply the result of integrating Eq. (1.11) from tl to t, and

using the fact that segmentation parameters remain constant between two hitting

times.

Proposition 4 characterizes the behavior of the functional ‖u(t) − x‖2, where x

is an arbitrary fixed signal satisfying TV (x) ≤ ν. This result is critical in demon-

strating the optimality of our algorithms of the next section.

Proposition 4 Let u(t) be the solution of (1.2,1.3), with TV (u(0)) > ν and TV (u(tν)) =

ν, for some positive constants ν, tν. Suppose that x ∈ RN is an arbitrary signal with

TV (x) ≤ ν. Then, for all t ∈ [0, tν ] except possibly the hitting times, we have:

1

2

d

dt‖u(t)− x‖2 ≤ ν − TV (u(t)), (1.13)

1

2

d

dt‖u(t)− u(tν)‖2 = ν − TV (u(t)). (1.14)

Proof is in Appendix A.

20

1.5 Optimal Estimation in 1-D

In this section, we present 1-D estimation problems that can be efficiently solved

using our evolution equation.

1.5.1 ML Estimation with a TV Constraint

Our first example is constrained maximum likelihood (ML) estimation [28] in

additive white Gaussian noise. Specifically, suppose that the observation u0 is an N -

dimensional vector of independent Gaussian random variables of variance σ2, whose

mean vector x is unknown. The only available information about x is that its total

variation is not larger than some known threshold ν. Given the data u0, the objective

is to produce an estimate x of x.

The ML estimate maximizes the likelihood of the observation [28],

p(u0|x) = (√

2πσ)−Ne−1

2σ2 ‖u0−x‖2 .

Simplifying the likelihood and taking into account the constraint TV (x) ≤ ν, we

obtain the following problem:

Find x = arg minx:TV (x)≤ν

‖u0 − x‖2. (1.15)

In other words, we seek the point of the constraint set {x : TV (x) ≤ ν} which is

the closest to the data u0. We now show that a fast way of solving this optimization

problem is to use Eq. (1.2,1.3).

Proposition 5 If TV (u0) ≤ ν, then the solution to Eq. (1.15) is x = u0. Otherwise,

a recipe for obtaining x is to evolve the system (1.2,1.3) forward in time until the

time instant tν when the solution u(tν) of the system satisfies TV (u(tν)) = ν. Then

x = u(tν). The ML estimate is unique, and can be found in O(N log N) time, and

with O(N) memory requirements, where N is the size of the data vector.

Proof of Proposition 5, part 1: optimality. The first sentence of the proposi-

tion is trivial: if the data satisfies the constraint, then the data itself is the maximum

21

likelihood estimate. Proposition 2 of the previous section shows that, if u(t) is the

solution of the system (1.2,1.3), then TV (u(t)) is a monotonically decreasing func-

tion of time, which changes continuously from TV (u0) to 0 in finite time. Therefore,

if TV (u0) > ν, then there exists a unique time instant tν when the total variation

of the solution is equal to ν. We now show that u(tν) is indeed the ML estimate

sought in Eq. (1.15), and that this estimate is unique.

Let us denote φ(x1,x2) = ‖x1 − x2‖2. To show that u(tν) is the unique ML

estimate, we need to prove that

φ(u0,u(tν)) < φ(u0,x) for any x 6= u(tν) such that TV (x) ≤ ν. (1.16)

To compare φ(u0,u(tν)) with φ(u0,x), note that

φ(u0,u(tν)) = φ(u(tν),u(tν))−∫ tν

0

d

dt{φ(u(t),u(tν))} dt,

φ(u0,x) = φ(u(tν),x)−∫ tν

0

d

dt{φ(u(t),x)} dt.

Since φ(u(tν),u(tν)) = 0 and φ(u(tν),x) > 0 for x 6= u(tν), our task would be

accomplished if we could show that

−∫ tν

0

d

dt{φ(u(t),u(tν))} dt ≤ −

∫ tν

0

d

dt{φ(u(t),x)} dt. (1.17)

Note that, by Proposition 1, ddt

φ(u(t),x) is well defined for all t except possibly on

the finite set of the hitting times where the left derivative may not be equal to the

right derivative. Both integrals in (1.17) are therefore well defined. Moreover, (1.17)

would follow if we could prove that

d

dtφ(u(t),u(tν)) ≥ d

dtφ(u(t),x) for almost all t ∈ [0, tν ]. (1.18)

But this is exactly what Proposition 4 of Section 1.3 states.

Finding the constrained ML estimate thus amounts to solving our system of

ordinary differential equations (1.2,1.3) at a particular time instant tν . We now

develop a fast algorithm for doing this. Roughly speaking, the low computational

cost is the consequence of Propositions 2 and 3 and Corollary 1 (which provide

22

formulas for the efficient calculation and update of the quantities of interest), and

the fast sorting of the hitting times through the use of a binary heap [29].

Proof of Proposition 5, part 2: computational complexity. Eqs. (1.9,1.10)

show that, between the hitting times, every intensity value µi(u) changes at a con-

stant rate, and so does TV (u(t))–as illustrated in Figs. 1.6 and 1.7. It would thus be

straightforward to compute the solution once we know what the hitting times are,

and which regions are merged at each hitting time.

Since a hitting time is, by definition, an instant when the intensities of two

neighboring regions become equal, the hitting times are determined by the absolute

values of the first differences, vi(t) = |µi+1(u(t))−µi(u(t))|, for i = 1, . . . p(u(t))−1.

Let ri(t) = vi(t) be the rate of change of vi(t). It follows from Eq. (1.9) that, for a

fixed i, the rate ri(t) is constant between two successive hitting times:

vi(t + ∆t) = vi(t) + ∆tri(t).

Suppose that, after some time instant t = τ , the rate ri(t) never changes: ri(t) =

ri(τ) for t ≥ τ . If this were the case, we would then be able to infer from the above

formula that vi(t) would become zero at the time instant τ + Ei(τ), where Ei(τ) is

defined by:

Ei(τ) =

−vi(τ)/ri(τ), if ri(τ) < 0

∞, otherwise.(1.19)

But as soon as one of the vi’s becomes equal to zero, the two corresponding regions

are merged. The first hitting time is therefore t1 = mini

Ei(0). Similarly, the second

hitting time is t2 = t1 + mini

Ei(t1), and, in general,

The (l + 1)-st hitting time is tl+1 = tl + mini

Ei(tl).

We now show how to compute vi(t), ri(t), and the hitting times without explicitly

computing u(t).

23

Let the signal u∗(t) be comprised of average values of the initial condition u0,

taken over the regions of u(t):

u∗ni(t) = u∗ni+1(t) = . . . = u∗ni+1−1(t) =

1

mi

ni+1−1∑k=ni

u0k, for i = 1, 2, . . . , p, (1.20)

where the segmentation parameters are those of u(t). One of the key reasons for

the low time complexity of the algorithm is that the solution u(t) is never explicitly

computed until t = tν . Keeping track of u∗(t) is enough, because of a simple rela-

tionship between the two signals. We first derive this relationship, and then show

that keeping track of u∗(t) indeed leads to a computationally efficient algorithm.

An immediate corollary of the definition (1.20) is that u∗(t) has the same number

of regions as u(t): p(u∗(t)) = p, ρi(u∗(t)) = ρi. It is also immediate that edges occur

in the same places in these two signals: ni(u∗(t)) = ni for i = 1, . . . , p. It follows

from Eq. (1.12) that the intensity values within the regions are related:

µi(u(t)) = µi(u∗(t))− βiρi

mi

t. (1.21)

Finally, note that Eq. (1.21) implies that the minima and maxima of u∗(t) occur

in the same places as the minima and maxima of u(t). To see this, suppose that

µi(u(t)) < µi+1(u(t)). Then βi ≤ 0 and βi+1 ≥ 0, and therefore

µi+1(u∗(t))− µi(u

∗(t)) = (µi+1(u(t))− µi(u(t))) +

(βi+1ρi+1

mi+1

− βiρi

mi

)t > 0.

It is analogously shown that if µi(u(t)) > µi+1(u(t)), then µi(u∗(t)) > µi+1(u

∗(t)).

Therefore, the i-th region of u(t) is a maximum (minimum) if and only if the i-th

region of u∗(t) is a maximum (minimum), which means that βi(u∗(t)) = βi. We have

thus shown that the signals u∗(t) and u(t) have identical segmentation parameters,

and that the intensities of these signals are related through Eq. (1.21).

Putting these computations together, we have the following algorithm.

1. Initialize. If TV (u(0)) ≤ ν, output u(0) and stop. Else, assign l = 0, tl = 0,

TV (u(tl)) = TV (u(0)), u∗(tl) = u(0); initialize the segmentation parameters;

use Eq. (1.10) to initialize ˙TV (u(t)).

24

2. Find possible hitting times. For each i = 1, . . . , p−1, find vi(tl) = |µi+1(u(tl))−µi(u(tl))|from u∗(tl), using Eq. (1.21). Use Eq. (1.9) to calculate ri(tl) = vi(tl) from the

segmentation parameters of u∗(tl), and use Eq. (1.19) to find all the candidates

for the next hitting time: tl + E1(tl), . . . , tl + Ep−1(tl).

3. Sort. Store these candidates on a binary heap [29].

4. Find the next hitting time. Find j = arg mini

[tl + Ei(tl)]; find the next hitting time

tl+1 = tl + Ej(tl).

5. Find the new TV. Calculate the total variation at t = tl+1:

TV (u(tl+1)) = TV (u(tl)) + (tl+1 − tl) ˙TV (u(tl)).

6. Merge. If TV (u(tl+1)) ≥ ν, then merge regions j and j + 1, as follows:

i remove tl + Ej(tl) from the heap;

ii update u∗:

µj(u∗(tl+1)) =

mj(u∗(tl))µj(u

∗(tl)) + mj+1(u∗(tl))µj+1(u

∗(tl))

mj(u∗(tl)) + mj+1(u∗(tl));

iii update the time derivative of TV, using Eq. (1.10).

iv update the segmentation parameters via Eqs. (1.5-1.8);

v increment l by 1;

vi update the heap;

vii go to Step 4.

7. Output. Calculate tν and u(tν), output u(tν), and stop.

The rest of this section explains and analyzes this algorithm.

It is easy to see that both Steps 1 and 2 have O(N) time complexity. Indeed,

both of them involve a constant number of operations per region of the initial signal,

and since the total number of regions p cannot exceed the total number of samples

N , Steps 1 and 2 can take at most O(N) time. Note that Step 2 does not require

the knowledge of u(tl). A binary heap is built in Step 3–the complexity of this is

also O(N) [29].

Next follows a loop consisting of Steps 4, 5, and 6. Step 4 finds the minimum of

Ei’s, which is equivalent to extracting the root of a binary heap and is done in O(1)

time [29]. There are at most N − 1 hitting times; the worst-case scenario is that

25

the loop (Steps 4, 5, and 6) terminates after the last of these hitting times–i.e., after

N−1 iterations. In this case, the contribution of Step 4 to the total time complexity

is O(N).

The calculation in Step 5 is O(1). Removing one number from a binary heap

is O(log N) [29]. The calculation of µj(u∗(tl+1)) in Step 6 is O(1). Obtaining

˙TV (u(tl+1)) from ˙TV (u(tl)) via Eq. (1.10) is also O(1). The re-assignments (1.5)

and (1.6) are O(1) while the re-assignments (1.7) and (1.8) are never explicitly per-

formed, since the underlying data structure is a dynamic linked list.

Note that a merging of two regions does not change the speed of evolution of any

other regions. Therefore, updating the heap in Step 6 amounts to changing and re-

sorting two (at most) entries which involve the newly formed region. As follows from

our discussion of Step 2 above, changing the two entries takes O(1) time. Re-sorting

them on the heap is O(log N) [29]. One execution of Steps 5 and 6 therefore takes

O(log N), and so the contribution of Steps 5 and 6 to the time complexity (i.e., after

N − 1 or fewer iterations of the loop) is O(N log N).

The loop terminates when TV (u(tl+1)) < ν ≤ TV (u(tl)). Therefore, tν can be

found from:

TV (u(tν)) = ν = TV (u(tl)) + (tν − tl) ˙TV (u(tl))

⇒ tν =ν − TV (u(tl))

˙TV (u(tl))+ tl.

Since u(tν) and u(tl) have the same segmentation parameters, it follows that u∗(tν) =

u∗(tl). Therefore, we can use Eq. (1.21) to calculate u(tν) from u∗(tl) with O(N)

time complexity.

The total time complexity of the algorithm is therefore O(N log N).

Storing the initial N -point signal and the accompanying parameters requires

O(N) memory. As the number of regions decreases, the memory is released accord-

ingly. So the total space required is O(N).

26

1.5.2 The Bouman-Sauer Problem

The problem treated in the previous subsection is closely related to the tomo-

graphic reconstruction strategy considered by Bouman and Sauer in [15, 16]. They

proposed minimizing a functional which is the sum of a quadratic penalty on the

difference between the data and the estimate, and a regularization term which is a

discretization of the total variation. Specializing their formulation to a 1-D restora-

tion problem yields the following:

Find xBS = arg minxEBS(x), (1.22)

where EBS(x) = ‖u0 − x‖2 + λTV (x), (1.23)

and where u0 is the data and λ > 0 is a known parameter which controls the amount

of regularization. This problem has a simple probabilistic interpretation. Just like

in the previous subsection, we are estimating the mean vector x of a condition-

ally Gaussian random vector u0; however, now x is modeled as random, with prior

distribution

Ke−λ

2σ2 TV (x),

where K is a normalizing constant. The optimum xBS is then the MAP estimate [28]

of x based on observing u0.

We now show that, in order to solve this problem, we can still evolve our Eq.

(1.2,1.3) using our algorithm of the previous subsection, but with a different stopping

rule.

Proposition 6 The solution xBS to Problem (1.22) is unique and can be calculated

from the solution u(t) to the system (1.2,1.3). If λ/2 < tf where tf is the final

hitting time, then xBS = u(λ/2); otherwise xBS = u(tf ). The time complexity of

this calculation is O(N log N), and the space complexity is O(N), where N is the

size of the data.

27

Proof. First note that EBS(u0) = λTV (u0) whereas if TV (x) > TV (u0) then

EBS(x) > λTV (u0). Therefore, the total variation of the solution to the Bouman-

Sauer problem (1.22) cannot exceed the total variation of u0.

Let ν be a fixed real number such that 0 ≤ ν ≤ TV (u0). As shown in Proposi-

tion 2, there is a unique time instant tν during the evolution of Eq. (1.2,1.3) when

TV (u(tν)) = ν. Moreover, it is a direct consequence of Proposition 5 that u(tν)

minimizes ‖u0−x‖2 over the set Xν of all signals x for which TV (x) = ν. Since, for

every signal x in the set Xν , EBS(x) = ‖u0−x‖2+λν, it follows that u(tν) minimizes

EBS(x) over the set Xν . In order to find the global minimum of EBS(x) we therefore

need to minimize EBS(u(tν)) over ν, or equivalently, find the time instant t which

minimizes EBS(u(t)). Combining Propositions 2 and 3, we obtain:

d

dtEBS(u(t)) = (2t− λ)

p∑i=1

β2i ρ

2i

mi

.

For t < λ/2, this time derivative is negative; for t > λ/2, it is positive. Therefore,

if λ/2 < tf , the unique minimum is achieved when t = λ/2; if λ/2 ≥ tf then the

unique minimum is achieved when t = tf . Thus, xBS = u(min(λ/2, tf )) which can

be computed using our algorithm of the previous subsection whose time and space

complexity are O(N log N) and O(N), respectively.

We note that our algorithm of the previous subsection is related to Bouman and

Sauer’s optimization method [15,16] which they call segmentation based optimization.

In their method, a merge-and-split strategy is utilized where neighboring pixels with

equal intensities are temporarily merged.

1.5.3 The Rudin-Osher-Fatemi Problem

In [1], Rudin, Osher, and Fatemi proposed to enhance images by minimizing

the total variation subject to an L2-norm constraint on the difference between the

28

data and the estimate. In this section, we analyze the 1-D discrete3 version of this

problem:

Find xROF = arg minx:‖u0−x‖2≤σ2

TV (x), (1.24)

where u0 is the signal to be processed and σ is a known parameter. We now show

that, in order to solve this problem, we can still evolve our Eq. (1.2,1.3) using our

algorithm of Subsection 1.5.1, but with a different stopping rule.

Proposition 7 If σ2 ≥ ‖u(tf )− u0‖2, then a solution to Eq. (1.24) which achieves

zero total variation is xROF = u(tf ). Otherwise, the solution to Eq. (1.24) is unique,

and is obtained by evolving the system (1.2,1.3) forward in time, with the following

stopping rule: stop at the time instant when

‖u(t)− u0‖2 = σ2. (1.25)

This solution can be found in O(N log N) time, and with O(N) memory requirements,

where N is the size of the data vector.

Proof. According to Proposition 3, ‖u(t) − u0‖2 is a continuous, monotonically

increasing function of time, and therefore a time instant for which (1.25) happens

is guaranteed to exist and be unique, as long as 0 ≤ σ2 < ‖u(tf ) − u0‖2. Let ν be

the total variation of the solution at that time instant, and denote the time instant

itself by tν . Proposition 5 of Subsection 1.5.1 says that u(tν) is the unique solution

of the following problem:

min ‖u0 − x‖2, subject to TV (x) ≤ ν.

In other words, if x 6= u(tν) is any signal with TV (x) ≤ ν, then ‖u0 − x‖2 >

‖u0−u(tν)‖2 = σ2. This means that if x 6= u(tν) is any signal with ‖u0−x‖2 ≤ σ2,

then we must have TV (x) > ν. Therefore, u(tν) is the unique solution of (1.24).

We note moreover that the new stopping rule does not change the overall com-

putational complexity of our algorithm of Subsection 1.5.1. Every time two regions

3This means that our signals are objects in RN , rather than L2.

29

are merged, two terms in the sum∑

i βiρi/mi disappear and one new term appears.

All other terms stay the same, and therefore updating the sum after a merge takes

O(1) time. Once this sum is updated, computing the new ‖u(t)− u0‖2 is also O(1),

thanks to the Corollary of Proposition 3.

1.5.4 Adaptive Stopping

We now outline a strategy for a simple heuristic modification of our algorithm

for the case when the parameters ν, λ, σ of Subsections 1.5.1, 1.5.2, and 1.5.3,

respectively, are unavailable and cannot be easily estimated.

In the absence of these parameter values, a different stopping rule for our evo-

lution equation can be designed, based on the following qualitative considerations.

The underlying algorithm is a region-merging procedure which starts with a fine

segmentation and then recursively removes regions. The regions that are removed

at the beginning of the process typically correspond to noise. However, as shown

above, the process will eventually aggregate all signal samples into one region, and

therefore it will at some point start removing “useful” regions (i.e., the regions that

are due to the underlying signal that is to be recovered). A good stopping rule would

stop this evolution when most of the “noise” regions have been removed but most of

the “useful” regions are still intact. A good heuristic for distinguishing the two types

of region is the energy: we would expect the “noise” regions to have a small energy

and the “useful” regions to have a high energy. Specifically, we track the quantity

‖u∗(tl)− u∗(tl−1)‖2. At each hitting time, this quantity is evaluated and compared

to its average over the previous hitting times. When this quantity abruptly changes,

the evolution is stopped.

1.6 Conclusions

In this chapter, we presented a relationship between nonlinear diffusion filtering

and optimal estimation. We showed that this relationship is precise in 1-D and

30

results in fast restoration algorithms both in 1-D and 2-D. In particular, we developed

O(N log N) algorithms to exactly solve the following three problems in 1-D.

• The problem of finding the TV-constrained maximum likelihood estimate of

an unknown N -point signal corrupted by additive white Gaussian noise.

• The Bouman-Sauer problem [15,16] of minimizing a TV-regularized functional.

• The Rudin-Osher-Fatemi problem [1] of constrained TV minimization.

In each case, our algorithm requires one parameter–the stopping criterion–which is

related to the noise variance. When this parameter is unavailable, we can use an

adaptive version of our procedure.

The generalization of our algorithm to 2-D is achieved by defining a region as

any connected set of pixels, and defining two regions to be neighbors if they share

a boundary, as described in [26]. The 2-D algorithm was illustrated above and

shown to produce comparable results to the existing methods, with considerable

computational savings.

31

2. BEST BASIS SEARCH IN LAPPED DICTIONARIES

2.1 Introduction

Adaptive signal representation and approximation in overcomplete dictionaries

have received much attention in recent years. The contributions of this chapter are

in the area of best basis search algorithms where the aim is to adaptively select,

from a dictionary of orthonormal bases, the basis which minimizes a cost for a

given signal [2, 3, 30]. Such methods have been demonstrated to be effective for

compression [31–34], estimation [35–43], and time-frequency (or space-frequency)

analysis [4, 44–48].

The original work on best basis search [2, 3] exploited the fact that a dictionary

consisting of local cosine bases [49–52] on dyadic intervals can be represented as

a single dyadic tree. This made it possible to find the best basis, for an additive

cost function, via an efficient tree pruning algorithm. On the other hand, it has

been noticed that, for an additive cost function, the optimal segmentation of a 1-D

signal can be efficiently found using dynamic programming. This has been exploited

in many contexts such as piecewise polynomial approximation [53–55], best basis

search in time-varying wavelet packet [48] and MDCT [56] dictionaries, estimation

of abrupt changes in a linear predictive model [57], and optimal selection of cosine-

modulated filter banks [58]. In this chapter, we exploit a similar idea to remove

the restriction of [2, 3] that the supports of local cosine basis functions be dyadic,

and use a dynamic programming algorithm to find the best basis in a much larger

collection of local cosine bases. As we show through several examples, this results in

sparser representations, more accurate time-frequency descriptions, and approximate

shift-invariance. We show that this algorithm can moreover be made strictly shift-

invariant by using a procedure similar to the one developed in [4]. Using a noise

32

removal example, we show that best basis thresholding in the proposed dictionary

results in a higher SNR and a lower RMS error than the best dyadic local cosine

decomposition. We furthermore propose two accelerated versions of the algorithm

which explore various trade-offs between computational efficiency and adaptability,

and which are based on the idea of two-stage processing of the data: first, small

pieces of a signal are processed using dynamic programming within each piece, and

then the results are combined using another dynamic programming sweep.

The use of our algorithms is not restricted to local cosine dictionaries. For ex-

ample, lapped bases in the frequency domain were used in [38,59,60]. We propose a

novel construction which represents the discrete cosine transform (DCT) of a signal

in a local cosine dictionary, and therefore corresponds to representing the original

signal in a dictionary whose elements are the inverse DCT’s of the local cosine func-

tions. We give an example where noise removal using this new dictionary results in

a higher SNR and a lower RMS error than the best local cosine representation.

While we develop and illustrate our algorithms using two dictionaries—the local

cosines in the time domain and in the DCT domain—we show in Section 2.4 that

our algorithms are applicable to any finite dictionary comprised of lapped orthogonal

bases.

Note that the paper [2] proposed using an entropy cost. Since then, a number

of papers have proposed different criteria, e.g., based on the MDL principle [35,

39], best basis thresholding [37, 38, 41], Bayesian estimation [36, 37], rate-distortion

framework [31,44,45,48]. We do not address the issue of cost selection in this chapter,

and use two standard cost functions—entropy [2] and thresholding cost [37, 38]—to

illustrate our framework. Our algorithms, however, can be used in conjunction with

any additive or multiplicative cost.

33

2.2 Local Cosine Decompositions

2.2.1 Best Basis Search Problem

The general best basis search problem is formulated, for example, in [2, 38]. We

consider a dictionary D that is a set of orthonormal bases for RN :

D = {Bλ}λ∈Λ,

where each basis Bλ consists of N vectors:

Bλ = {gλm}1≤m≤N .

The cost of representing a signal f in a basis Bλ is typically defined as follows [2,38]:

C(f,Bλ) =N∑

m=1

Φ

( |〈f, gλm〉|2

‖f‖2)

or C(f,Bλ) =N∑

m=1

Φ(|〈f, gλ

m〉|2), (2.1)

where Φ is application dependent. Any basis which achieves the minimum of the

cost C(f,Bλ) over all the bases in the dictionary, is called the best basis. In this

chapter, we develop fast algorithms for finding the best basis in local cosine dictio-

naries.

2.2.2 A Local Cosine Dictionary

We identify each vector in RN with a signal f(n) defined for n = 0, 1, . . . , N − 1.

A local cosine basis [38,49–52] for RN is defined using cosine functions multiplied by

overlapping smooth windows. For each discrete interval [u, v − 1] ⊂ [1, N − 2], we

define a window function βu,v (see Fig. 2.1(a)) which gradually ramps up from zero

to one around u− 1/2 and goes down from one to zero around v − 1/2:

βu,v(t) =

r(

t−(u−1/2)η

)if u− 1

2− η ≤ t < u− 1

2+ η

1 if u− 12

+ η ≤ t < v − 12− η

r(

(v−1/2)−tη

)if v − 1

2− η ≤ t ≤ v − 1

2+ η

0 otherwise,

34

tv − 12

+ ηv − 12− ηu − 1

2+ ηu − 1

2− η0

1

βu,v(t)

(a) A window function βu,v. (b) A local cosine basis function.

Fig. 2.1. A window function βu,v and an element of a local cosine dictionary.

where the parameter η ∈ R controls how fast the window tapers off, and r is a

monotonically increasing profile function,

r2(t) + r2(−t) = 1 ∀t ∈ R

r(t) =

0 if t < −1,

1 if t > 1.

For u = 0, the window βu,v does not ramp up to one but rather starts off directly at

one:

β0,v(t) =

1 if t < v − 12− η

r(

(v−1/2)−tη

)if v − 1

2− η ≤ t ≤ v − 1

2+ η

0 otherwise,

for 2 ≤ v ≤ N − 1. Similarly, the windows βu,N are equal to one for u− 1/2 + η ≤t ≤ N − 1, and the window β0,N is equal to one on the whole interval [0, N − 1].

Following [38], we define the discrete local cosine family Bu,v as follows:

Bu,v =

{βu,v(n)

√2

v − ucos

[π

(κ +

1

2

)n− (u− 1

2)

v − u

]}0≤κ<v−u

,

where n ∈ Z is a discrete time parameter and κ ∈ Z is a discrete frequency param-

eter. One signal from such a family is depicted in Fig. 2.1(b). It can be shown [38]

that this set of signals is orthonormal if v − u ≥ 2η.

35

For a signal f of length N , we search for the best basis in the local cosine

dictionary

D =⋃λ∈Λ

Bλ (2.2)

which consists of the following local cosine bases:

Bλ =

Kλ−1⋃k=0

Bnk,nk+1, (2.3)

where λ is a set of partition points {nk}0≤k≤Kλof the domain of f . If the partition

points are such that only adjacent windows overlap (i.e., if nk+1 − nk ≥ 2η for all

k) then Bλ is an orthonormal basis for RN [38]. In order to achieve this, we impose

that the finest cell size be some fixed integer M ≥ 2η, i.e., we require the partition

points to be integer multiples of M :

n0 = 0 < n1 < · · · < nKλ−1 < nKλ= N (2.4)

nk is divisible by M where M ≥ 2η is a fixed integer. (2.5)

We will refer to the resulting D as a mod-M dictionary. We note that a mod-M

dictionary is larger than the local cosine tree dictionary of [2]. In fact, if we choose M

such that N/M = 2J where J is the maximum depth of the local cosine tree of [2],

it can be easily shown that the local cosine tree dictionary of [2] will be a subset of

the mod-M dictionary.

2.2.3 A Best Basis Algorithm

We now describe an efficient best basis search algorithm for our mod-M dictio-

nary. It is a dynamic programming algorithm whose slight variants have been widely

used in literature since [53] to find the best segmentation of a 1-D signal. Our ex-

position closely follows [48] where it was used to find the best block wavelet packet

basis.

36

O0,N ← best basis(f) {for u = N −M, N − 2M, . . . , 2M, M, 0 {Ou,N ← Bu,N ; //Initialize Ou,N

C∗u,N ← C(f,Bu,N ); //Initialize C∗

u,N

for d = u + M, u + 2M, . . . , N −M {if C(f,Bu,d) + C∗

d,N < C∗u,N {

Ou,N ← Bu,d ∪ Od,N ;

C∗u,N ← C(f,Bu,d) + C∗

d,N ;

}}save Ou,N and C∗

u,N in an internal data structure;

}return O0,N ;

}

Fig. 2.2. Pseudocode specification of a fast dynamic programmingalgorithm for the best local cosine basis search. The cost of the bestbasis Ou,N is denoted by C∗u,N .

Let 0 ≤ u < v ≤ N , and let the best basis associated with the window βu,v be

denoted by Ou,v. For v − u > M ,

Ou,v =

Bu,d∗ ∪ Od∗,v if C(f,Bu,d∗) + C(f,Od∗,v) < C(f,Bu,v),

Bu,v otherwise,(2.6)

where

d∗ = arg mind: u<d<v, d is a multiple of M

C(f,Bu,d) + C(f,Od,v).

(Note that, since the cost function is additive, the cost of Bu,d ∪Od,v is C(f,Bu,d) +

C(f,Od,v).) The initial condition is that for v − u = M ,

Ou,v = Bu,v.

Then the best basis O0,N for signal f can be calculated via dynamic programming, by

repeatedly applying (2.6). The C pseudocode for this algorithm is shown in Fig. 2.2.

In Fig. 2.2, we use C∗u,v to denote the cost of the best basis Ou,v, and assume that the

costs C(f,Bu,v) have been precomputed. The algorithm calculates Ou,N and C∗u,N

for u = N −M,N − 2M, . . . , 2M,M, 0. The calculation of each Ou,N involves a loop

37

over u + M,u + 2M, . . . , N −M , with O(1) computations within each iteration of

the loop. Therefore, the dynamic programming has time complexity O(L2) where

L = N/M . The major computational burden is associated with computing the

costs C(f,Bu,d). The calculation of C(f,Bu,v) via the definition (2.1) involves O(a)

additions where a = v − u, as well as the computation of the inner product of f

with each basis function in Bu,v which requires O(a log a) operations using a fast

local cosine transform algorithm [38, 52]. In the process of calculating O0,N , we

need the values for C(f,Bu,v) with u = pM, v = qM where p = 0, 1, . . . , L − 1

and q = p + 1, p + 2, . . . , L. It is easy to show that this results in the overall time

complexity1 of O(L2N log N).

2.2.4 Shift-Invariance: A Qualitative Discussion

We call a best basis search algorithm n0-shift-invariant if circularly shifting any

signal by an arbitrary integer multiple of n0 leads to shifting its best basis by the

same multiple of n0. When n0 = 1—i.e., when the algorithm is invariant to any

integer shift, we simply call it shift-invariant.

The mod-M method described in Section 2.2.3 is, strictly speaking, not M -shift-

invariant, since we always require the leftmost basis function to start at the leftmost

point of the signal. It is, however, M -shift-invariant, modulo these boundary effects:

i.e., it is invariant to shifts by integer multiples of M for signals whose support is

well within the interval [0, N − 1].

The dyadic best local cosine basis algorithm of [2, 3] is fundamentally not shift-

invariant since it uses a dyadic tree. Its variant introduced in [4] is formally shift-

invariant; however, we now show that the mod-M method offers certain advantages.

To illustrate the shift-invariance properties of the algorithms, we use a 256-point

signal depicted in Fig. 2.3(a) which consists of two local cosine basis functions, one

with u = 32 and v = 64, and another one with u = 128 and v = 160. For each

1The time complexities of all our algorithms are summarized in a table in Appendix C.

38

100 200 100 200 100 200

(a) (e) (i)

Time-frequency planes for best local cosine bases:

(b) Dyadic. (f) Dyadic. (j) Dyadic.

(c) SI-LCD. (g) SI-LCD. (k) SI-LCD.

(d) Mod-M. (h) Mod-M. (l) Mod-M.

Fig. 2.3. The original signals and time-frequency representations ofthe best local cosine basis with smallest cell size M = 16: (a) a signalconsisting of two local cosine basis functions; (b) the time-frequencytiling for the best local cosine basis of [2, 3]; (c) the time-frequencytiling for the shift-invariant local cosine decomposition [4]; (d) thetime-frequency tiling for the best mod-M local cosine basis; (e-h) asimilar experiment for the signal in (a) shifted by 16 samples; (i-l) asimilar experiment for a signal where the two local cosine bumps areshifted by different amounts. The darker the rectangle in (b-d,f-h,j-l)the larger the amplitude of the corresponding local cosine coefficient.

39

algorithm, the smallest cell size M is chosen to be 16. (For the single-tree methods,

this means that the maximal tree depth is set to J = log2 N − log2 M = 4.) We

follow [2] and use the entropy cost function,

C(f,Bλ) = −N∑

m=1

|〈f, gλm〉|2

‖f‖2 ln|〈f, gλ

m〉|2‖f‖2 .

Figs. 2.3(b) and (d) show the time-frequency tilings2 for the best basis extracted

by the method of [2] and the mod-M method, respectively. These are identical.

However, when the signal is shifted by 16 samples to the right (Fig. 2.3(e)), the

result for the mod-M method stays the same (Fig. 2.3(h)) whereas the dyadic best

basis changes drastically (Fig. 2.3(f)).

This can be fixed by the shift-invariant local cosine decomposition (SI-LCD) pro-

posed in [4] which essentially considers N shifted versions of the dictionary, and

is therefore shift-invariant. The best basis extracted by the SI-LCD algorithm

(Fig. 2.3(g)) is identical to the best mod-M basis. Let us now consider another

signal, obtained by taking the signal of Fig. 2.3(a), retaining its first component as

is, and shifting its second component to the right by 16, as shown in Fig. 2.3(i).

The mod-M algorithm is still invariant to this change, as evidenced by Fig. 2.3(l).

SI-LCD, however, produces a different basis.

2.2.5 A Strictly Shift-Invariant Algorithm

The qualitative discussion of Section 2.2.4 shows that the mod-M algorithm pos-

sesses the desired shift-invariant properties, even though it is not, strictly speaking,

shift-invariant. We now show, in addition, that we can make it strictly invariant to

any integer shift, using a method similar to [4].

For a discrete signal f of length N , we extend both the signal and the basis

functions periodically with period N , so that all shifts of all signals will effectively

2To depict a coefficient corresponding to a local cosine basis function with frequency κ and window βu,v, we use a

rectangle which extends horizontally from u to v and vertically from κ

v−uto κ+1

v−u. The larger the coefficient, the

darker the rectangle.

40

1000 2000 3000 1200 1400

(a) “Grea” speech signal. (d) A 512-point segment.

Time-frequency planes for best local cosine bases:

(b) Dyadic, C = 4.11. (e) Dyadic, C = 2.44.

(c) Mod-M, C = 3.51. (f) Mod-M, C = 2.02.

Fig. 2.4. Two signals and the time-frequency pictures of their bestbases: (a) segment “grea” of the speech signal “greasy”; (b,c) thetime-frequency tilings for the best local cosine basis of [2] and for thebest mod-M local cosine basis, respectively; (d-f) a similar experi-ment for a shorter segment of the speech signal.

41

128 64 32 16

3.5

3.7

3.9

4.1

Cell size M

Cos

t C

dyadicSI−LCDmod−MSI−mod−M

128 64 32 162

2.3

2.6

Cell size M

Cos

t C

dyadicSI−LCDmod−MSI−mod−M

(a) “Grea” speech signal. (b) A 512-point segment.

Fig. 2.5. The performance of four algorithms for extracting the bestlocal cosine basis: dyadic [2] (dotted), shift-invariant LCD [4] (dash-dot), the proposed mod-M algorithm (solid), and the proposed shift-invariant version of the mod-M algorithm (dashed). The optimalcost is depicted as a function of the minimal allowed cell size: (a)4096-point “grea” speech signal, (b) 512-point segment of the signal.

be circular shifts. We expand the dictionary of Section 2.2.2 by adding in the shifts

of the basis signals. We define D0 to be the same as the dictionary of Eqs. (2.2,2.3),

and let Ds be D0 shifted by s to the left:

Ds =⋃λ∈Λ

Bλs where Bλ

s =

Kλ−1⋃k=0

Bnk+s,nk+1+s.

The new dictionary DSI is defined to be the union of the N shifted sub-dictionaries:

DSI =N−1⋃s=0

Ds.

Now the best basis search involves finding the subdictionary Ds∗ that contains the

best basis and searching for the best basis in Ds∗ . Using an argument similar to

the one in Section 2.2.3, it can be shown that the optimal solution is achieved

in O(N3 log N) time. In addition, we present in Appendix B a suboptimal solution

based on the method in [4], to result in the time complexity similar to that of

Section 2.2.3.

42

2.2.6 Examples with the Entropy Cost

To further illustrate our methods, we use two examples which compare our pro-

posed mod-M local cosine decomposition with the best local cosine basis selection

based on a single dyadic tree [2]. We again use the entropy cost function. We set

η = 8.

Fig. 2.4(a-c) shows a speech signal of length N = 4096 and the time-frequency

pictures for the best bases selected by the two methods. The minimal cell size for

these experiments was set at M = 16 for both methods. The resulting costs are:

4.11 for the dyadic dictionary and 3.51 for ours. In addition, note the sparser time-

frequency representation in Fig. 2.4(c) resulting from our method.

In Fig. 2.4(d), we zoom into the samples 1001 through 1512 of the signal in

Fig. 2.4(a). For this 512-point segment, we compute the best basis with the two

methods, again setting M = 16. This results in the following costs: 2.44 for the

dyadic dictionary and 2.02 for ours. Again, the representation resulting from our

method corresponds to a more sparse time-frequency tiling. Moreover, the transition

between two phonemes (in the neighborhood of the sample 1150) is missed by the

best dyadic basis but is accurately captured by the best mod-M basis.

Fig. 2.5 summarizes a larger experiment where the best basis was found for

four different values of the minimal cell size M : 128, 64, 32, and 16. In addition

to the dyadic single-tree algorithm and the mod-M algorithm, we compared the

results with the shift-invariant versions of the two algorithms: the shift-invariant

local cosine decomposition (SI-LCD) [4] and our shift-invariant mod-M algorithm

described in Appendix B. The resulting costs for the four algorithms are plotted as

a function of M . Note that in both cases, the whole curve for the mod-M algorithm

is below each of the outcomes for the algorithms in [2, 4]. This is to be expected

since we perform the search over a much larger dictionary. The price to pay is the

time complexity of the algorithm, which, as indicated above, is higher than the time

complexity for the dyadic best-basis search algorithm. Note, however, that for small

43

400 800−0.1

0

0.1

400 800−0.1

0

0.1

100 200

−0.4

0.6

1000 1100

−0.4

0.6

Fig. 2.6. Top row: two local cosine functions. Bottom row: twofunctions from the frequency-domain local cosine dictionary obtainedby taking the inverse DCT-IV of the functions in the top row.

signal lengths the running time of the two algorithms is similar. For example, in our

experiment with the 512-point signal, the running times3 for M = 128, 64, 32, 16 are

0.01, 0.02, 0.02, and 0.03 seconds, respectively, for the dyadic algorithm and 0.01,

0.01, 0.05, and 0.19, respectively, for the mod-M algorithm. This suggests that

the most practical way of using this algorithm is on blocks whose length is a small

multiple of M . Sections 2.3.2 and 2.3.3 investigate this idea.

2.2.7 Frequency-Domain Local Cosines

Frequency-domain lapped bases have been suggested in, e.g., [38, 59, 60]. For

example, it was shown in [60] that decomposing a signal in a Meyer wavelet basis [59]

is equivalent to decomposing its spectrum in a lapped trigonometric basis. We

propose a new dictionary of lapped bases in the frequency domain which we call

3All our code was written in Matlab and run using Matlab 6.5 under Windows XP on a machine with a Pentium-M

1.4GHz processor.

44

the mod-M frequency-domain local cosine (FDLC) dictionary. This dictionary is

obtained by taking the inverse discrete cosine transform (DCT) of each basis vector

of the mod-M local cosine dictionary defined in Eqs. (2.2-2.5). (A dyadic FDLC

dictionary can similarly be obtained by taking the inverse DCT of each basis vector

of the dyadic local cosine dictionary.) Two FDLC basis vectors are depicted in the

bottom row of Fig. 2.6; their DCT’s are members of the local cosine dictionary and

are shown in the top row of Fig. 2.6.

To find the best basis of a signal f in this dictionary, we calculate the DCT

f of f , and then find the best local cosine basis for f using the mod-M method

described above. Since DCT is an orthogonal transform, |〈f, gλm〉|2 = |〈f , gλ

m〉|2 and

‖f‖2 = ‖f‖2, and therefore the costs (2.1) computed in the DCT domain are identical

to the costs in the time domain.

The FDLC dictionary offers alternative ways of tiling the time-frequency plane

and is better suited than the local cosine dictionary to the analysis of some types

of nonstationary signals, for example, those whose energy is mostly concentrated

in several frequency bands. This is illustrated in Fig. 2.7 where the noisy “grea”

speech signal, f , and its DCT are shown in (a) and (b), respectively. The noise-free

signal, x, and its DCT, are shown in Figs. 2.7 (c) and (d), respectively. Fig. 2.7(e)

shows the basis vector gLC from the best mod-M local cosine basis for f whose

inner product with f is the largest, and Fig. 2.7(f) shows the DCT of this basis

vector. Fig. 2.7(g) shows the basis vector gFDLC from the best mod-M FDLC basis

whose inner product with f is the largest, and Fig. 2.7(h) shows the DCT of this

basis vector. It is evident from Figs. 2.7(f) and 2.7(h) that gFDLC is more sharply

focused around the strongest resonant frequency of x than gLC whose spectrum is

more spread out. As we show in Section 2.2.8 and Table 2.1, the noise removal

performance in the problem of recovering x from this observation f is better for the

best FDLC basis than for the best local cosine basis.

45

1000 2000 3000 1000 2000 3000

(a) Noisy “grea” speech signal. (b) Its DCT-IV.

1000 2000 3000 1000 2000 3000

(c) “Grea” speech signal. (d) Its DCT-IV.

1000 2000 3000 1000 2000 3000

(e) Best LC dictionary element. (f) Its DCT-IV.

1000 2000 3000 1000 2000 3000

(g) Best FDLC dictionary element. (h) Its DCT-IV.

Fig. 2.7. (a) A noisy speech signal; (b) its DCT; (c) noise-free speechsignal; (d) its DCT; (e) the basis vector from the best mod-M localcosine basis whose inner product with the signal in (a) is the largest;(f) its DCT; (g) the basis vector from the best mod-M FDLC basiswhose inner product with the signal in (a) is the largest; (h) its DCT.

46

2.2.8 Noise Removal Examples

Following [41], we adopt the following procedure for estimating a signal x from

its noisy measurement f : we find the best basis from a dictionary D, decompose f in

the best basis, threshold the coefficients, and reconstruct an estimate of x from the

remaining coefficients. We use hard thresholding, i.e., we keep every coefficient whose

absolute value is above a threshold T , and replace all other coefficients with zeros. As

suggested in [37,38], we use the following cost function for a basis Bλ = {gλm}1≤m≤N :

C(f,Bλ) =N∑

m=1

Φ(|〈f, gλm〉|2),

with

Φ(u) =

u− σ2 if u ≤ T 2,

σ2 if u > T 2.

In our experiments, we follow [61] and fix T = 3.8σ. In Table 2.1 and Fig. 2.8,

we present the noise removal results for the dyadic and mod-M versions of both

the time-domain and frequency-domain local cosine dictionaries. To perform noise

removal with a frequency-domain local cosine dictionary, a signal is transformed

using DCT-IV. Both the best basis extraction and thresholding are then done in the

DCT domain. Finally, the resulting DCT coefficients are transformed back using the

inverse DCT-IV.

SNR(db) RMS error cost

Dyadic Local Cosines 11.23 107.03 3.2e+007

Mod-M Local Cosines 11.62 102.38 2.1e+007

Dyadic Frequency-Domain Local Cosines 12.51 92.36 2.5e+007

Mod-M Frequency-Domain Local Cosines 12.62 91.29 1.8e+007

Table 2.1Comparison of dyadic and mod-M method with noise level σ = 150(SNR=8.22db).

47

For both time-domain and frequency-domain local cosine dictionaries, the mod-

M dictionary leads to a lower cost, higher SNR, and lower RMS error than the dyadic

dictionary. The FDLC dictionaries quite dramatically outperform their respective

time-domain counterparts, achieving higher SNRs, lower RMS errors, lower costs,

and significantly sparser time-frequency tilings shown in the last row of Fig. 2.8.

In addition, the last row of Fig. 2.8 shows that the FDLC dictionaries resolve the

resonant frequencies (i.e., frequencies corresponding to the peaks in the DCT plots

of Figs. 2.8(b) and 2.8(d)) much better than the local cosine dictionaries.

2.3 Further Extensions of the Basic Algorithm

2.3.1 Extension 1, Min-M: Allowing Arbitrary Positions for Windows

The mod-M algorithm restricted the length of the local cosine windows to be

integer multiples of the finest interval size M . We now allow arbitrary window length

with a lower bound M . This results in a larger dictionary than the dictionary of

Section 2.2.2. The dictionary and the local cosine bases in it are defined in the same

way as in Eqs. (2.2,2.3), while the constraints (2.4,2.5) on the partition points are

changed to the following:

n0 = 0 < n1 < · · · < nKλ−1 < nKλ= N (2.7)

nk+1 − nk ≥M for k = 0, 1, . . . , Kλ − 1, where M ≥ 2η is a fixed integer.(2.8)

This new dictionary will be referred to as a min-M dictionary. The recursion formula

for the best basis search is the same as Eq. (2.6); however, the search for d∗ is now

done over a different set. Specifically, for v − u ≥ 2M ,

Ou,v =

Bu,d∗ ∪ Od∗,v if C(f,Bu,d∗) + C(f,Od∗,v) < C(f,Bu,v)

Bu,v otherwise,

where

d∗ = arg mind: u+M≤d≤v−M

C(f,Bu,d) + C(f,Od,v).

48

1000 2000 3000 1000 2000 3000

(a) “grea” signal (e) dyadic LC, SNR=11.23db (i) dyadic LC tiling

1000 2000 3000 1000 2000 3000

(b) DCT-IV of the “grea” signal (f) mod-M LC, SNR=11.62db (j) mod-M LC tiling

1000 2000 3000 1000 2000 3000

(c) noisy signal, SNR=8.22db (g) dyadic FDLC, SNR= 12.51 (k) dyadic FDLC tiling

1000 2000 3000 1000 2000 3000

(d) DCT-IV of the noisy signal (h) mod-M FDLC, SNR= 12.62 (l) mod-M FDLC tiling

Fig. 2.8. Best basis thresholding with dyadic and mod-M local co-sine dictionaries in time domain and in frequency domain. The sec-ond row shows various estimates of the signal (a) based on its noisyobservation (c), and the third row shows the corresponding tilingsof the time-frequency plane. From left to right: (e,i) dyadic localcosine dictionary; (f,j) mod-M local cosine dictionary; (g,k) dyadicfrequency-domain local cosine dictionary; (h,l) mod-M frequency-domain local cosine dictionary.

49

The initial condition is that for v − u < 2M ,

Ou,v = Bu,v.

This method reduces the cost more significantly, but the price to pay is more compu-

tation. The time complexity of this method is O(N3 log N), which is prohibitive for

large signals. However, both this method and the mod-M algorithm of Section 2.2.3

can be used on small blocks of a signal, and the results can be combined via a post-

processing step. By varying the size of each block, we can achieve various tradeoffs

between the cost and the time complexity. The next two subsections describe two

procedures for blockwise application of our algorithms.

2.3.2 Extension 2: Blocks Algorithm

We divide a signal f of length N into blocks of equal size M2. Let L2 = N/M2

be the total number of blocks. For each block flM2,(l+1)M2 , l = 0, 1, . . . , L2 − 1,

the best basis can be calculated using either the mod-M or the min-M algorithm.

Concatenating the partition points of the best bases calculated for all blocks, we get

a partition for the signal:

t0 = 0 < t1 < · · · < tI = N,

where I is the total number of segments. Since we imposed partition points at the

block boundaries, we need a postprocessing step to remove the artifacts and further

optimize the cost by considering the whole signal. The postprocessing will select a

set of partition points among the ones we obtained for the blocks. Let Otu,tv be the

best basis for ftu,tv after postprocessing. Then the postprocessing can be done using

the following recursive formula: when v − u > 1,

Otu,tv =

Btu,td∗ ∪ Otd∗ ,tv if C(f,Btu,td∗ ) + C(f,Otd∗ ,tv) < C(f,Btu,tv)

Btu,tv otherwise,(2.9)

where

d∗ = arg mind: u<d<v

C(f,Btu,td) + C(f,Otd,tv). (2.10)

50

When v − u = 1,

Otu,tv = Btu,tv . (2.11)

We now calculate the time complexity of the blocks algorithm where the mod-M

algorithm of Section 2.2.3 is applied to each block. As mentioned in Section 2.2.3,

the calculation of the best basis via the mod-M algorithm is O(L2M2 log M2) for

each block of size M2, where L = M2/M . For L2 blocks, the total time complexity

before postprocessing is therefore O(L2L2M2 log M2) = O(L

2N log M2). The time

complexity of the postprocessing step is calculated similarly to that of the mod-M

algorithm of Section 2.2.3, and is O(I2N log N) where I is the number of segments

before postprocessing. Since in the worst case, I can be equal to the number of cells

L, the worst-case time complexity of the postprocessing step alone is O(L2N log N)

which is the same as the complexity of the mod-M algorithm applied to the whole

signal. In practice, however, if appropriate values M2 are used, it is typical for I to

be significantly smaller than L, leading to considerable computational savings. The

overall time complexity of the two stages of the blocks algorithm is O(NL2log M2+

I2N log N) = O(NL2log M2 + L2N log N). It can be similarly shown that if the

blocks algorithm is used in conjunction with the min-M algorithm of Section 2.3.1,

the overall time complexity will be O(NM22 log M2 + L2N log N).

We illustrate the blocks algorithm on the signal “Grea” whose length is N =

4096. We use the mod-M algorithm within each block. We fix the cell size M = 16

and η = 8, and vary the block size M2. The results are shown in Table 2.3(A). The

results of dyadic and mod-M methods with the same M and η are shown in Ta-

ble 2.2 for comparison. By using different values of M2, we are able to obtain various

trade-offs between optimizing the cost and minimizing the running time. When M2

is very large, there is essentially no difference between using blocks and applying

mod-M to the whole signal. In fact, if M2 = N then the two algorithms produce

identical results except the blocks algorithm makes some unnecessary computa-

tions. This is illustrated by comparing the last line of Table 2.3(A) and the last line

of Table 2.2. When M2 is very small, the first stage of the blocks algorithm tends

51

to produce many partition points, and the bulk of the computation is done during

the postprocessing stage. The algorithm is the fastest for the medium values of M2.

The fact that partition points are imposed at the block boundaries contributes

to the poor performance of the first stage of the blocks algorithm for small block

sizes M2. This problem can be ameliorated by using overlapping blocks, as described

in the next subsection and illustrated in Table 2.3(B).

2.3.3 Extension 3: Overlapping-Blocks Algorithm

We process an N -point signal f using L2 overlapping blocks which do not nec-

essarily have the same length. We set M2 = N/L2. We denote the index of the

leftmost and the rightmost points of the i-th block (i = 0, . . . , L2 − 1) by li and

ri − 1, respectively (i.e., the block itself is denoted by fli,ri). We fix ri = (i + 1)M2

and l0 = 0. The point li+1 and the basis Oli,li+1for fli,li+1

are recursively found by

applying either the mod-M algorithm or the min-M algorithm to fli,risubject to the

constraint that the first partition point of the best basis is to the right of ri−1 − 1.

Denoting the optimal partition points of fli,riby n0, n1, . . . , nKi

, we therefore have:

n0 = li < ri−1 ≤ n1 < . . . < nKi−1 < nKi= ri.

If i = L2 − 1, we set li+1 = N ; otherwise, we set li+1 = nKi−1. We let

Oli,li+1= Bli,n1 ∪ Bn1,n2 ∪ . . . ∪ BnKi−2,li+1

.

Once this is done for all i = 0, 1, . . . , L2 − 1, we take the overall basis for f to be

Ol0,l1 ∪ Ol1,l2 ∪ . . . ∪ OlL2−1,lL2.

We can again use the postprocessing procedure described in Section 2.3.2, Eqs. (2.9-

2.11); however, it may not be needed since the overlapping-blocks algorithm

does not typically result in blocking artifacts.

We illustrate the overlapping-blocks algorithm on the signal “Grea” whose

length is N = 4096. We use the mod-M algorithm within each block. We fix the

52

cell size M = 16 and η = 8, and vary the block size M2. The results are shown in

Table 2.3(B). The results of dyadic and mod-M methods with the same M and η

are shown in Table 2.2 for comparison.

While the worst-case time complexity of the overlapping-blocks algorithm

can be shown to be the same as that of the mod-M algorithm, we have observed

that, in practice, the overlapping-blocks algorithm can be significantly faster for

appropriate values of M2. The intuition described above for the blocks algorithm

holds here, too, as illustrated in Table 2.3(B): when M2 is very large, the first stage

takes a long time; when M2 is very small, the postprocessing takes a long time; the

algorithm is the fastest for the medium values of M2.

Note that overlapping-blocks is usually faster than blocks, without much

difference in the achieved cost. The reason is that it is able to eliminate more parti-

tion points during stage 1, and therefore its postprocessing stage typically takes less

time. Also note that, for medium and large values of M2, the overlapping-blocks

algorithm does not, in fact, need the postprocessing stage, since postprocessing does

not reduce the cost. Dispensing with the postprocessing stage further reduces the

running time. In addition, this makes it possible to process the data in a sequential

manner: once Oli,li+1is determined, the data for [li, li+1 − 1] can be discarded.

time cost

dyadic 0.22s 4.1

mod-M, Section 2.2.3 53s 3.5

Table 2.2Running times and costs for dyadic and mod-M algorithms.

53

Blocks

before postprocessing after postprocessing

M2 time cost time cost

32 0.16s 5.43 24s 3.54

64 0.19s 5.11 9.4s 3.63

128 0.29s 4.69 3.9s 3.70

256 0.58s 4.18 2.2s 3.75

512 1.3s 3.81 2.3s 3.66

1024 4.2s 3.75 5.0s 3.65

2048 16s 3.59 17s 3.59

4096 54s 3.51 54s 3.51

(a) Blocks

Overlapping-blocks

before postprocessing after postprocessing

M2 time cost time cost

32 0.23s 4.51 3.3s 3.68

64 0.26s 4.32 2.9s 3.68

128 0.37s 4.14 1.5s 3.79

256 0.67s 3.83 1.7s 3.80

512 1.5s 3.59 2.2s 3.59

1024 4.4s 3.61 5.0s 3.61

2048 16s 3.56 17s 3.56

4096 54s 3.51 55s 3.51

(b) Overlapping-blocks

Table 2.3Running times and costs for the (a) blocks algorithm and (b)overlapping-blocks algorithm, each used with the mod-M al-gorithm.

54

O0,N ← best basis(f) {for u = N −M, N − 2M, . . . , 2M, M, 0 {

if u == 0 {A← {0} //A Special definition for the leftmost endpoint

}for η ∈ A {Oη,u,N ← Bη,0

u,N ; //Initialize Oη,u,N

C∗η,u,N ← C(f,Bη,0

u,N ); //Initialize C∗η,u,N

for d = u + M, u + 2M, . . . , N −M {for η′ ∈ A {

if C(f,Bη,η′u,d ) + C∗

η′,d,N< C∗

η,u,N {Oη,u,N ← Bη,η′

u,d ∪ Oη′,d,N ;

C∗η,u,N ← C(f,Bη,η′

u,d ) + C∗η′,d,N

;

}}

}save Oη,u,N and C∗

η,u,N in an internal data structure;

}}O0,N ← O0,0,N ;

return O0,N ;

}

Fig. 2.9. Pseudocode specification of a fast dynamic programmingalgorithm for the best-basis search in a lapped dictionary.

2.4 Best Basis Search in Lapped Dictionaries

General lapped orthogonal bases [38, 52, 62, 63] are not required to use cosine

functions; they may use a more general family of orthogonal functions which satis-

fies certain symmetry conditions. Moreover, nonsymmmetric windows can be used.

Specifically, a window for a discrete interval [u, v − 1] may be defined as follows:

βη,η′u,v (t) =

r(

t−(u−1/2)η

)if u− 1

2− η ≤ t < u− 1

2+ η

1 if u− 12

+ η ≤ t < v − 12− η′

r(

(v−1/2)−tη′

)if v − 1

2− η′ ≤ t ≤ v − 1

2+ η′

0 otherwise,

where r is a profile function just as in Section 2.2.2, but η is not necessarily equal to

η′. As in Section 2.2.2, special definitions are made for u = 0 and for v = N .

55

As in Section 2.2.2, we suppose that λ is a set of partition points, but we now

assume that each partition point nk comes with its own profile parameter ηk: λ =

{(nk, ηk)}0≤k≤Kλ. Then, provided that the functions eκ,nk,nk+1

satisfy the appropriate

symmetry and orthogonality properties, and that the partition points are such that

only adjacent windows overlap (i.e., nk+1−nk ≥ ηk +ηk+1 for all k), it can be shown

that

Bηk,ηk+1nk,nk+1

4= {βηk,ηk+1

nk,nk+1(n)eκ,nk,nk+1}0≤κ<nk+1−nk

is an orthonormal family, and

Bλ 4=Kλ−1⋃k=0

Bηk,ηk+1nk,nk+1

is an orthonormal basis for RN [38]. A finite dictionary of such bases may be specified

by allowing the same set of valid partitions as in Eqs. (2.4,2.5) and restricting all

valid profile parameters ηk to a finite set A. By adding the search over the set A,

the mod-M algorithm of Fig. 2.2 is modified to search for the best basis in this

dictionary. The resulting modified algorithm is shown in Fig. 2.9. In this figure,

Oη,u,N denotes the best basis associated the window βη,0u,N , and C∗η,u,N denotes the

corresponding cost. This modified algorithm is very generic and can be used to

perform a best-basis search for any mod-M dictionary consisting of a finite number

of lapped orthogonal bases. The extensions of the basic algorithm discussed above

also apply to the generic algorithm. The complexity of the generic algorithm will

depend on the size of the set A and, more generally, on the complexity of calculating

the costs C(f,Bη,η′u,d ).

2.5 Conclusions

We have developed several best basis search algorithms to adaptively compute

local cosine decompositions. Simple examples show that our algorithms yield lower

costs, sparser representations, and better shift-invariance properties than the dyadic

best basis search. In addition, they can better represent important time-frequency

56

features and be more effective for noise removal. The price we pay is a higher

computational complexity; however, in applications where the speed is important,

it is possible to use accelerated versions of our algorithms by first applying them to

small blocks and then combining the results via a postprocessing step.

We have moreover introduced a new dictionary of frequency-domain local cosines

and showed that it can result in improved representations. We provided a generic

version of our algorithms which can be used to find the best basis in any finite

dictionary of lapped orthogonal bases.

57

3. FAST SEARCH FOR BEST REPRESENTATIONS IN

MULTITREE DICTIONARIES

3.1 Introduction

A number of research efforts have recently concentrated on developing adaptive

algorithms for representing and approximating signals in overcomplete dictionaries.

This chapter addresses the best basis problem—or, more generally, the best represen-

tation problem: given a signal, a dictionary of representations, and an additive cost

function, the aim is to select the representation from the dictionary which minimizes

the cost for the given signal. This paradigm has been successfully used for problems

in compression [31–34], estimation [35–43], and time-frequency (or space-frequency)

analysis [4, 44–48,64,65].

The original papers on best basis search [2, 3] considered the wavelet packet

bases [3] and bases of local cosines [49–52] on dyadic intervals. In each of these two

cases, all the bases in the dictionary can be organized using a single tree: a binary

tree in 1-D and a quadtree in 2-D. This organization was exploited in [2,3] to devise

a fast recursive tree pruning algorithm to find the best basis for any additive cost

function.

Since then, a number of efforts have sought to lift the restrictions that a fixed

binary/quadtree structure imposes on the underlying dictionary. Search methods for

various dictionaries that correspond to different sets of possible time-frequency or

space-frequency tilings have been proposed, such as the double-tree algorithm [44],

time-frequency trees [47,48], space-frequency trees [45], adaptive Haar-Walsh tilings

[33], anisotropic wavelet packets [30, 40], anisotropic cosine packets [30], and mixed

isotropic/anisotropic packets [30].

58

The main contributions of this chapter are:

• a new framework of multitree dictionaries which includes some previously pro-

posed dictionaries as special cases;

• a fast recursive algorithm to find the best representation of data in a multitree

dictionary;

• several application examples, including a novel image coder, which typically

reduces the bit rate by about 25-40% compared to JPEG and by about 10-20%

compared to the quadtree-based approach of [31], and whose rate-distortion

performance is comparable to that of embedded wavelet coders such as JPEG-

2000 and SPIHT.

We start our discussion in Section 3.2 with a simple example of an optimal rect-

angular tiling algorithm. A simple modification of this algorithm leads to a best

wedgelet algorithm for arbitrary rectangular tilings which we present in Section 3.3.

Two further extensions of our basic tiling algorithm are described in Section 3.4. Sec-

tion 3.5 applies our algorithm to the problem of image compression. In Section 3.6,

we introduce the general framework of multitree dictionaries, and argue that the

algorithms of Sections 3.2, 3.3, and 3.4 are special cases of a general recursive algo-

rithm for finding the best object in a multitree dictionary. In Section 3.7, we then

discuss relationships of our framework and algorithms to previously proposed best

basis algorithms, and to other application areas.

3.2 Example 1: Optimal Rectangular Tilings

3.2.1 A Fast Recursive Tiling Algorithm

We consider all images supported on a discrete rectangular domain Q ⊂ Z2.

Suppose we are given an image f and would like to segment it into rectangular tiles

59

(a) An admissible tiling. (b) An inadmissible tiling.

(c) A sequence of splits. (d) Another sequence of splits.

Fig. 3.1. An illustration of tilings and sequences of splits. (a) Anadmissible tiling—i.e., a tiling that can be obtained via a sequenceof binary splits. (b) An inadmissible tiling. (c) A sequence of splitsthat leads to the tiling in (a). (d) Another sequence of splits thatleads to the tiling in (a).

P1, P2, . . . , Pd so as to minimize a cost which is equal to the sum of the costs of the

individual tiles:d∑

i=1

e(Pi), (3.1)

where e is a cost function which is application specific and which depends on the

image f .

We restrict our choice of tilings, and only consider those tilings that can be

obtained through the following recursive binary splitting process:

• start with a tiling which consists of a single tile—namely, the whole image

domain;

• for every tile in the tiling which consists of more than one pixel,

either keep it and do not split it ever again,

60

or split it into two smaller rectangular tiles;

• continue until all the tiles in the tiling either consist of one pixel or are labeled

“never split again”.

A rectangular tiling which can be obtained through this procedure is called an ad-

missible tiling. An admissible tiling is illustrated in Fig. 3.1(a). The rectangular

tiling depicted in Fig. 3.1(b) cannot be obtained through the binary splitting pro-

cess described above, even though every tile in the tiling is a rectangle. This tiling

is therefore not an admissible tiling.

The binary splitting process is conveniently visualized as a tree, with every node

of the tree corresponding to a unique rectangular region of the image, as shown in

Fig. 3.1(c).1 We therefore use the terms node and rectangular region interchangeably.

In particular, the entire image domain corresponds to the root of the tree. The yield

of the binary tree—i.e., the set of all leaves—is then a tiling of the image. We

therefore use the terms leaf node and tile interchangeably. The set of all such trees

will give us the set of all admissible tilings (however, several different trees may

correspond to the same tiling, as shown in Fig. 3.1(c,d)).

To efficiently solve our optimal tiling problem, we assign the cost given in Eq. (3.1)

to every tree t whose yield is an admissible tiling {P1, . . . , Pd}:

cost0(t) =∑

P∈yield(t)

e(P ). (3.2)

We then search over all trees to find one of the trees with the smallest cost. The

optimal tiling is then the yield of this tree. Since our search space consists of multiple

trees, we call it a multitree dictionary. Our efficient search algorithm exploits the

fact that although the number of possible trees is very large [38, 66], the number of

rectangular tiles is much smaller and manageable.

To describe our search algorithm, let C∗P be the cost of the optimal tiling for a

rectangle P . In particular, C∗Q = mint

cost0(t) is the optimal cost for the entire

1In the figure, a short vertical (horizontal) line through a node signifies a vertical (horizontal) split.

61

image domain Q. Our algorithm makes the following recursive call, starting with

P = Q:

C∗P = min{e(P ), min(C∗P ′ + C∗P ′′)}, (3.3)

where the inner minimization is done over all ordered pairs of rectangles (P ′, P ′′)

which partition the rectangle P :

P = P ′ ∪ P ′′ and P ′ ∩ P ′′ = ∅.

We always assume that, if the split is horizontal, then P ′ is on top of P ′′, and if the

split is vertical, then P ′ is to the left of P ′′.

The recursive call (3.3) terminates at the pixels:

if P is a pixel, then C∗P = e(P ). (3.4)

To avoid repetitive calculation, we store the optimal cost and the optimal split

for each rectangle in a table. Before making a recursive call for any rectangle P ,

the table is consulted to make sure that P has not been visited before. If the

original image domain is N1 × N2, it has O(N21 N2

2 ) different subrectangles, and

therefore maintaining the table requires O(N21 N2

2 ) memory. With this table, we only

need to make one recursive call per rectangle. Since each recursive call involves

O(N1 + N2) comparisons to calculate C∗P via Eq. (3.3)—corresponding to N1 − 1

horizontal splits and N2 − 1 vertical splits—the computational complexity of the

search algorithm is O(N21 N2

2 (N1 + N2)) which is O(N2.5) for a square image with N

pixels, N1 = N2 =√

N .

The pseudocode for the search algorithm is shown in Fig. 3.2. The optimal left

child of P is denoted by s∗P , and the optimal overall tiling by B∗P . Fig. 3.2(a) shows the

pseudocode for the recursive calculation of the optimal splits and corresponding costs

which are stored in a global data structure Table. Once this piece of pseudocode

is executed, the optimal tiling is constructed using the routine in Fig. 3.2(b) which

is assumed to have access to the same global data structure Table. Specifically,

62

(C∗P , s∗P )← best split v0(P ) {

if C∗P has been computed

get C∗P and s∗P from the global data structure Table;

else {s∗P ← ∅; //Initialize best left child s∗PC∗

P ← e(P ); //Initialize best cost C∗P

for (P ′, P ′′) = a partition of P into two rectangles {(C∗

P ′ , s∗P ′ )← best split v0(P ′);(C∗

P ′′ , s∗P ′′ )← best split v0(P ′′);if C∗

P ′ + C∗P ′′ < C∗

P {s∗P ← P ′; //Update s∗PC∗

P ← C∗P ′ + C∗

P ′′ ; //Update C∗P

}}record C∗

P and s∗P in the global data structure Table;

}return C∗

P and s∗P ;

}

(a) Recursive calculation of the optimal splits and corresponding costs.

B∗P ← best tiling v0(P ) {get s∗P from the global data structure Table;

if s∗P is the empty set

B∗P ← {P};else

B∗P ← best tiling v0(s∗P ) ∪ best tiling v0(P\s∗P );

return B∗P ;

}

(b) Recursive generation of the best tiling.

Fig. 3.2. Pseudocode specification of a fast recursive search for thebest rectangular tiling: (a) the recursive calculation of the optimalleft children s∗P and the corresponding costs C∗P ; (b) the recursive gen-eration of the best tiling. It is assumed that both routines have accessto the same global data structure Table. The optimal tiling B∗Q ofan image domain Q is obtained with (C∗Q, s∗Q) ← best split v0(Q),followed by B∗Q ← best tiling v0(Q).

the optimal tiling B∗Q of an image domain Q is obtained with the following two

commands:

(C∗Q, s∗Q) ← best split v0(Q),

B∗Q ← best tiling v0(Q),

63

which call the two routines in Fig. 3.2.

3.2.2 A Simple Cost Function

The preceding discussion supposes that the individual costs e(P ) have been pre-

computed for every rectangle P . We analyze this computation using the following

simple cost:

e(P ) =∑

(n1,n2)∈P

(f(n1, n2)− fP )2 + w, (3.5)

which results in the following overall cost of a tiling {P1, . . . , Pd}:d∑

i=1

∑(n1,n2)∈Pi

(f(n1, n2)− fPi)2 + wd, (3.6)

where

f(n1, n2) is the pixel value at the location (n1, n2);

fPiis the average of the image f over the rectangle Pi;

d is the number of tiles in the tiling;

w is an application-specific penalty on the number of tiles (such as, e.g., the

average coding complexity in a compression application).

For this particular cost function (3.5), computing e(P ) for every rectangle P can

be done very efficiently by defining the following two statistics:

ρ1(f, P ) =∑

(n1,n2)∈P

f(n1, n2) = |P |fP

ρ2(f, P ) =∑

(n1,n2)∈P

f(n1, n2)2,

and noticing that, if we know these two statistics for a pair of rectangles (P ′, P ′′)

which partition a rectangle P , we can calculate e(P ) in O(1) time as follows:

ρ1(f, P ) = ρ1(f, P ′) + ρ1(f, P ′′)

ρ2(f, P ) = ρ2(f, P ′) + ρ2(f, P ′′)

e(P ) = ρ2(f, P )− ρ21(f, P )/|P |+ w.

64

This is used to compute all the costs in a bottom-up fashion, with both time and

space complexity O(N21 N2

2 ).

3.2.3 Reducing the Computational Complexity

The overall time complexity of the optimal tiling algorithm with the cost (3.5)—

i.e., the computation of the costs and the recursive search combined—is O(N21 N2

2 (N1+

N2)). The overall space complexity is O(N21 N2

2 ).

Note that reducing the number of admissible rectangular tilings may result in a

lower computational complexity of the algorithm. For example, we can restrict the

search space if we only allow a rectangle to be split into two congruent rectangles, as

was done in, e.g., [40]. In other words, we can impose that during our recursive binary

splitting process, an n1 × n2 rectangle may only be split either into two n1/2 × n2

rectangles, or into two n1 × n2/2 rectangles. This “dyadic tiling” scenario is called

“dyadic CART” in [40] and is similar to the anisotropic wavelet packets [30, 40].2

It can be shown that in this case, the total number of possible rectangular tiles is

O(N1N2), and therefore the computation of the costs has time and space complexity

O(N1N2). The minimization in Eq. (3.3) is O(1) since it now involves choosing

one of no more than three options: horizontal split or vertical split or no split.

Therefore, both the time and space complexity of the search is O(N1N2), which is

also the overall complexity of the algorithm—i.e., the computation of the costs and

the recursive search combined. In this case, the complexity is linear in the number

of pixels.

Another way of reducing the computation time and memory requirements is

restricting the split locations to only occur at multiples of some integer M > 1. In

this case, the elementary cells in the resulting tilings will be M×M rectangles rather

than single pixels. Our rectangular tiling algorithms, with M = 16, are illustrated

2The scenario which is similar to the classical wavelet packets results from imposing that, furthermore, any horizontal

split must be followed by a vertical one, and vice versa. In other words, if an n1 × n2 rectangle resulted from a

horizontal split, it is only allowed to be split into two n1 × n2/2 rectangles; and if it resulted from a vertical split,

it is only allowed to be split into two n1/2× n2 rectangles.

65

(a) Cameraman image. (b) Best dyadic tiling, cost 0.57 (c) Best arbitrary tiling, cost 0.44

Fig. 3.3. A 256 × 256 cameraman image and its best rectangulartilings with the smallest cell size 16× 16: (b) best dyadic tiling, cost0.57; (c) best arbitrary tiling, cost 0.44.

in Fig. 3.3: Fig. 3.3(b) shows the result of the dyadic search, and Fig. 3.3(c) shows

the result of the full search.

We also note that for any set of admissible tilings, a further reduction in compu-

tational complexity can be achieved by sacrificing optimality and using a suboptimal,

greedy search method proposed in, e.g., [67, 68].

The problems addressed in the remainder of the chapter exemplify many situa-

tions where the computation of the costs may be more complex than O(1) per pixel

and in fact may dominate the computational complexity of the overall algorithm.

3.3 Example 2: Optimal Wedgelet Tilings

3.3.1 Algorithm Extension 1: State Variables

In the best wedgelet algorithm [42], each tile can be represented using one of

several wedgelets. In our image coding algorithm in Section 3.5, we will allow the

choice of several quantizers for encoding each tile. To model these choices, we intro-

duce the concept of a state variable. To every tile P , we associate a state variable

xP taking values in some finite set which, without loss of generality, we assume to be

{1, 2, . . . , X} where X is some fixed integer. Each term of the cost function is now

66

region P’intensity

intensityregion P’’

µ’

µ’’

∆

Fig. 3.4. A wedgelet.

10 12 14 16 18 20 22 240.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

PSNR IN DB

RA

TE

IN B

ITS

/PIX

EL

Best Quadtree Wedgelets

Best Dyadic Wedgelets

(a) Quadtree wedgelets. (b) Dyadic wedgelets. (c) Rate-distortion curves.

Fig. 3.5. Two best wedgelet tiling examples for an 128× 128 binaryimage: (a) Quadtree wedgelets, SNR=17.1 dB, rate = 0.0062 bitsper pixel; (b) Dyadic wedgelets, SNR = 17.8 dB at 0.0055 bits perpixel. Panel (c) shows the rate-distortion curves for this image, forthe quadtree wedgelets (dashed) and the dyadic wedgelets (solid).

allowed to depend on the corresponding state variable—in other words, we replace

the cost given in Eq. (3.2) with the following:

cost1(t) =∑

P∈yield(t)

c(P, xP ). (3.7)

Note that if we let e(P ) = minxP

c(P, xP ), this cost becomes the same as cost0 in

Eq. (3.2). Therefore, the search for the best tree and the best tiling now consists of

two steps: finding the best state for each tile P via minimizing c(P, xP ) with respect

to xP , and then applying our recursive algorithm of Fig. 3.2(a).

67

3.3.2 Wedgelet Experiments

A wedgelet [42] is an image defined on a rectangular domain and consisting of

two constant pieces which are joined together along a straight line, as illustrated

in Fig. 3.4. We can represent a wedgelet on a domain P as a quadruple xP =

(P ′, P ′′, µ′, µ′′) where P ′ and P ′′ are the two regions that the straight line partitions

P into, and µ′ and µ′′ are the respective image intensities. Alternatively, P ′ and P ′′

can be specified by the two endpoints of the line. It is typically assumed that the

endpoints are restricted to a grid with some small step ∆, as shown in Fig. 3.4.

Given an image f , we can approximate the image values over a rectangular

domain P with a wedgelet xP = (P ′, P ′′, fP ′ , fP ′′) where fP ′ and fP ′′ are the average

intensities of f over the regions P ′ and P ′′, respectively. We penalize any such

approximation using the following simple cost function which is similar to Eq. (3.5):

c(P, xP ) =∑

(n1,n2)∈P ′(f(n1, n2)− fP ′)2 +

∑(n1,n2)∈P ′′

(f(n1, n2)− fP ′′)2 + 2w.

In addition, we still allow approximating an image tile with a constant, and still use

the cost in Eq. (3.5) in this case.

Our fast search algorithm can then find the optimal wedgelet tiling. Fig. 3.5

depicts some examples for a binary image. Fig. 3.5(a) shows the best quadtree

wedgelet tiling. This strategy was proposed in the original wedgelet paper [42]. Al-

lowing more possibilities for split locations leads to more compact and more precise

wedgelet tilings. The best dyadic wedgelet tiling is shown in Fig. 3.5(b) and al-

lows each rectangle to be split into two congruent rectangles either horizontally or

vertically.

We assumed the following simple approximation for the number of bits required

to encode our wedgelet tilings:

• one bit per node to encode whether it is an internal node or a leaf;

• one bit per leaf node to encode whether it is a constant tile or a wedgelet;

68

• one bit per leaf node to encode the intensity (this is a reasonable approximation,

since our input image is binary);

• log2(((M + N)/∆)2) bits per wedgelet leaf node of size M ×N , to encode the

position of the wedgelet partition;

• in addition, for dyadic wedgelet tilings, we spend one bit per internal node to

encode whether it is split horizontally or vertically.

With these assumptions, the quadtree tiling of Fig. 3.5(a) produces SNR of 17.1

dB and rate 0.0062 bits per pixel, whereas Fig. 3.5(b) has both a higher SNR of

17.8 dB and a lower rate of 0.0055 bits per pixel. Note also that the quadtree tiling

has 16 tiles whereas the dyadic tiling has only eight tiles. Dyadic tilings outperform

quadtree tilings, achieving lower rates at the same SNR’s and higher SNR’s at the

same rates for this image, as shown in Fig. 3.5(c). The curves in Fig. 3.5(c) were

obtained by varying the split penalty w.

3.4 Further Extensions of the Optimal Tiling Algorithm

3.4.1 Algorithm Extension 2: Incorporating Internal Nodes into the

Cost

Recall that in previous sections, the trees played an auxiliary role since the cost

only depended on the yield of the tree—i.e., the leaf nodes—but was independent of

the internal nodes of the tree. However, in some applications the internal structure

of the tree matters. For example, in the wedgelet experiments of the previous section

as well as in the compression experiments which will be discussed in Section 3.5, the

structure of the tree must be encoded, and the encoding costs may be different for

two different trees which correspond to the same tiling. We would like to be able

to include these costs in the cost function optimized by our algorithm. To model

this and a variety of other such situations where the internal structure of the tree is

important, we now equip every node P with a state xP , and use a cost function c to

69

penalize the split of a node P with a state xP into nodes P ′ and P ′′ with states xP ′

and xP ′′ , respectively. Our new cost for any tree t is:

cost2(t) =∑

P∈internal-nodes(t)

c

(P ′, xP ′) (P ′′, xP ′′)

(P, xP )

+

∑P∈yield(t)

c(P, xP ), (3.8)

where

in the first summation, the nodes P ′ and P ′′ are the children of the node P on

the tree t;

xP , xP ′ , and xP ′′ are the state variables associated with the nodes P , P ′, and

P ′′, respectively;

c and c are application-specific cost functions.

Note that this cost is a generalization of cost1(t) in Eq. (3.7). Indeed, if we set

c ≡ 0, then cost2(t) = cost1(t). Note also that, in the cost (3.5,3.6) which we used

in our tiling experiments, the penalty w can be interpreted as a split cost function c

which assigns a constant penalty w to each split.

We let C∗P,x be the cost of the optimal tree for a rectangle P , given xP = x, and

we let C∗P be the cost of the overall optimal tree for P , i.e., C∗P = minx

C∗P,x. The

optimal tree is found using the following recursion:

C∗P,x =

c(P, x), if P is an elementary cell,

min

8>><>>:

c(P, x), minP ′,P ′′,x′,x′′

2664c

0BB@

(P ′, x′) (P ′′, x′′)

(P, x)1CCA + C∗

P ′,x′ + C∗P ′′,x′′

3775

9>>=>>;

, otherwise.

(3.9)

This recursion is similar to Eqs. (3.3,3.4) and can therefore be implemented using

the pseudocode in Figs. 3.6 and 3.7 which are extensions of Figs. 3.2(a) and 3.2(b),

respectively.

70

(C∗P,x, s∗P,x)← best split v2(P, x) {

if C∗P,x has been computed

get C∗P,x and s∗P,x from the global data structure Table;

else {// Initialize

s∗P,x ← ((∅, 0), (∅, 0));

C∗P,x ← c(P, x);

for x′ = 1 : X, x′′ = 1 : X, (P ′, P ′′) = a partition of P into two rectangles {(C∗

P ′,x′ , s∗P ′,x′ )← best split v2(P ′, x′);(C∗

P ′′,x′′ , s∗P ′′,x′′ )← best split v2(P ′′, x′′);

if C∗P ′,x′ + C∗

P ′′,x′′ + c

0BBBBB@ (P ′, x′) (P ′′, x′′)

(P, x)

1CCCCCA

< C∗P,x {

// Update

s∗P,x ← ((P ′, x′), (P ′′, x′′));

C∗P,x ← C∗

P ′,x′ + C∗P ′′,x′′ + c

0BBBBB@ (P ′, x′) (P ′′, x′′)

(P, x)

1CCCCCA

;

}}record C∗

P,x and s∗P,x in the global data structure Table;

}return C∗

P,x and s∗P,x;

}

Fig. 3.6. Pseudocode for the recursive calculation of the optimalsplits and states and the corresponding costs for cost2 of Sec-tion 3.4.1.

t∗P,x ← best tree v2(P, x) {get s∗P,x ≡ ((P ′, x′), (P ′′, x′′)) from the global data structure Table;

if P ′ is the empty set

t∗P,x ← [(P, x)];

else

t∗P,x ←

2666664 best tree v2(P ′, x′) best tree v2(P ′′, x′′)

(P, x)

3777775

;

return t∗P,x;

}

Fig. 3.7. Pseudocode for the recursive generation of the best tree for Section 3.4.1.

71

3.4.2 Algorithm Extension 3: Dynamic Programming Over a Sequence

of Blocks

If an image is partitioned into K blocks Q1, Q2, . . . , QK—as in, for example,

JPEG and [31]—our algorithm can be used to find the optimal tiling within each

block. In [31], it was assumed that each block is handled independently. However, as

argued in [69, 70], it is sometimes advantageous to assume that pairs of consecutive

blocks are interdependent. In order to model this new assumption, we let t1, . . . , tK

be the trees corresponding to the blocks Q1, . . . , QK , respectively, and assign the

following cost to this collection of trees {t1, . . . , tK}:

cost-blocks(t1, . . . , tK) =K∑

k=2

¯c(Qk, xQk, Qk−1, xQk−1

) +K∑

k=1

cost2(tk). (3.10)

Let ¯C∗1:i,x be the optimal cost for i blocks, given that xQi= x. In other words,

¯C∗1:i,x is defined as the result of minimizing cost-blocks(t1, . . . , ti) subject to xQi=

x. Then we have the following recursion for ¯C∗1:i,x:

¯C∗1:i,x =

C∗Q1,x for i = 1,

minx′

(¯c(Qi, x,Qi−1, x′) + C∗Qi,x

+ ¯C∗1:i−1,x′), for i = 2, . . . , K,(3.11)

where C∗Qi,xis computed through the recursion (3.9), using the pseudocode in Fig. 3.6.

The overall optimal cost, which we denote ¯C∗1:K , is found from:

¯C∗1:K = minx

¯C∗1:K,x.

This recursive calculation is performed using the dynamic programming algorithm

of Fig. 3.8, similar to those used in [69,70].

3.5 Example 3: Multitree Image Coding Algorithm

We fuse our rectangular tiling algorithm with several aspects of the compression

strategy in [31], to obtain an image coder which finds the optimal tiling, and encodes

every tile. The input is partitioned into blocks Q1, . . . , QK , in the raster order.

Within each block, we find the optimal tree t∗k and encode it as follows:

72

(t∗1, . . . , t∗K)← best tree sequence(Q1, . . . , QK) {// Initialization

for x = 1 : X, P = Q1 : QK

(C∗P,x, s∗P,x)← best split v2(P, x);

for x = 1 : X {¯C∗1:1,x ← C∗

Q1,x;

optimal previous state1:1,x ← 0;

}// Forward sweep

for i = 2 : K, x = 1 : X {¯C∗1:i,x ← min

x′ (¯c(Qi, x, Qi−1, x′) + C∗Qi,x + ¯C∗

1:i−1,x′);

optimal previous state1:i,x ← arg minx′ (¯c(Qi, x, Qi−1, x′) + C∗

Qi,x + ¯C∗1:i−1,x′);

}//Backtracking

x∗ = arg minx

¯C∗1:K,x;

for i = K : −1 : 1 {t∗i ← best tree v2(Qi, x

∗);

x∗ ← optimal previous state1:i,x∗ ;

}return t∗1, . . . , t∗K ;

}

Fig. 3.8. Pseudocode for the dynamic programming over blocks, Section 3.4.2.

• one bit per node is used to indicate whether the node is an internal node or a

leaf;

• for each node with a state x ∈ {1, . . . , X}, we use dlog2 Xe bits to encode the

state x;

• dlog2 splitsP e bits are used to encode the split location for every internal node

P , where splitsP is the total number of possible split locations for the node

P .

To find the optimal tree, we optimize with respect to the rate-distortion cost [31]

D + λR, where R is the number of bits it takes to encode the image, D is the total

distortion, and λ is a parameter. We assume that the distortion D is additive over the

tiles and over the blocks. In our experiments, we use the sum of squared differences

as our distortion criterion. For each tile, we follow a JPEG-like procedure which

finds the DCT coefficients, quantizes them, and entropy-codes the AC coefficients

73

and differential DC coefficients. The DC coefficients are differentially coded in the

following manner:

• the root DC coefficient for the first block Q1 is encoded;

• the difference between the root DC coefficients for the k-th block and the

(k − 1)-st block is encoded, for k = 2, . . . , K;

• for every leaf node P of every tree t∗k, the difference between the DC coefficient

for P and the root DC coefficient is encoded.

Following [31], we assume that one of several quantizers can be used for each tile,

and optimize our choice of the quantizer for each tile concurrently with the search

for the optimal tiling. The state xP corresponds to the quantizer used for the tile

P . In addition, we allow the choice of the same set of quantizers to encode the root

DC coefficient.

Because of the differential coding of the DC coefficients, the bit rate within each

block can be shown to have the form of Eq. (3.8), and the overall bit rate is additive

over pairs of consecutive blocks and is therefore of the form (3.10). This, combined

with the additivity of the distortion, means that the overall cost D + λR is of the

form (3.10). This means that, in order to optimize it, we can use the algorithm of

Section 3.4.2 and Fig. 3.8.

In order to minimize the distortion subject to a fixed rate, or to minimize the

rate subject to a fixed distortion, our optimization algorithm can be used within an

iterative procedure similar to that of [31].

3.5.1 Compression Experiments

We compare our multitree-JPEG compression algorithm with standard JPEG

and with the quadtree-based algorithm of [31].3 We test the algorithms on four

3The rate-distortion curves we obtain for the JPEG and quadtree algorithms are different from those given in [31]

since we use a somewhat different implementation—for example, we use a different set of quantization matrices.

74

26 28 30 32 34 36 38 40 42 440

0.5

1

1.5

2

2.5

3

3.5

4

PSNR IN DB

RA

TE

IN B

ITS

/PIX

EL

JPEG CODERQUADTREE CODERMULTITREE CODER

25 30 35 40 450

0.5

1

1.5

2

2.5

3

3.5

PSNR IN DB

RA

TE

IN B

ITS

/PIX

EL


28 30 32 34 36 38 40 42 44 460

0.5

1

1.5

2

2.5

3

PSNR IN DB

RA

TE

IN B

ITS

/PIX

EL


26 28 30 32 34 36 38 40 42 44 460

0.5

1

1.5

2

2.5

3

3.5

PSNR IN DB

RA

TE

IN B

ITS

/PIX

EL


Fig. 3.9. Rate-Distortion curves for “goldhill”(top left), “barbara”(top right), “lenna” (bottom left), and “cameraman” (bottom right).

images: a 512×512 image “barbara”, and three 256×256 images “goldhill,” “lenna,”

and “cameraman”. The corresponding sets of rate-distortion curves are shown in

Fig. 3.9. In each figure, the rate in bits per pixel is plotted against the peak signal-to-

noise ratio (PSNR). For each quadtree and multitree experiment, a target distortion

was fixed, and the rate was minimized. Note that our multitree algorithm (solid)

outperforms the standard JPEG (dash) by about 2-4 dB and the quadtree algorithm

(dashdot) by about 1-2 dB at a fixed bit rate. Equivalently, the multitree algorithm

represents compression savings of about 25-40% over the standard JPEG and 10-20%

over the quadtree algorithm, for a fixed PSNR.

However, the relative improvement of the quadtree algorithm over JPEG that we observe is similar to what is

reported in [31].

75

In these experiments, we take the block size to be 16×16 and we take the smallest

cell size to be 4× 4—i.e., we allow rectangular tiles with sides 4, 8, 12, and 16. This

means that, for each 16 × 16 block, we search over 68480 distinct tilings—this is

in contrast to the quadtree method which only allows 17 distinct tilings, and the

standard JPEG which only considers one tiling. While the number of possible tilings

for our method is drastically larger, the number of distinct subrectangles of each

block—which is what determines the computational complexity of our algorithm—is

only 100, compared to 21 for the quadtree method and 4 for the standard JPEG.

Thus, we are able to search over a much larger set with only a modest increase in

the computational burden. It can be shown that the increase in the allowed number

of tilings is exponential as compared to the quadtree algorithm whereas the increase

in the computational burden is only polynomial.

The results for the “barbara” image at PSNR = 36.4 dB are given in Fig. 3.10: the

JPEG, quadtree, and multitree compression algorithms achieve 1.31, 1.00, and 0.83

bits per pixel, respectively. Note that the images look basically the same; however,

the multitree algorithm gives compression savings of 37% over JPEG and 17% over

the quadtree algorithm.

Fig. 3.11 illustrates the results for the same image at the bit rate 0.49 bits per

pixel. (In this experiment, the bit rate was fixed at 0.49, and the distortions for

the quadtree and multitree methods were minimized.) At this bit rate, the JPEG,

quadtree, and multitree algorithms achieve PSNR’s for the overall image of 28.3 dB,

30.5 dB, and 31.9 dB, respectively. A patch from the image and its three compressed

versions is shown in Fig. 3.11. In addition to a higher signal-to-noise ratio, it is clear

from the figure that the multitree algorithm results in both less blocky renditions of

homogeneous areas of the image, sharper edges, and less ringing and blockiness in

the textured areas and around the edges.

In these experiments, our implementation of JPEG is a baseline implementation

which uses Huffman coding of the coefficients. To make the comparisons fair, we use

similar Huffman coding strategies for the quadtree and multitree algorithms.

76

(a) Original image (b) JPEG, 1.31 bpp

(c) Quadtree compression, 1.00 bpp (d) Multitree compression, 0.83 bpp

Fig. 3.10. Results for the “barbara” image at PSNR = 36.4 dB: (a)original image, (b) JPEG (rate = 1.31 bits per pixel), (c) quadtreecompression (rate = 1.00 bits per pixel), and (d) multitree compres-sion (rate = 0.83 bits per pixel).

Further experiments show that, if we replace Huffman coding with arithmetic

coding, then our multitree coder becomes competitive when compared to the state-of-

the-art embedded wavelet coders such as JPEG2000 [71] and SPIHT [72] which both

employ arithmetic coding. Fig. 3.12 shows the rate-distortion curves for JPEG2000,

77

(a) A patch of “barbara” (b) JPEG

(c) Quadtree (d) Multitree

Fig. 3.11. Results for the “barbara” image at the bit rate of 0.49bits per pixel: (a) a patch of the original image, (b) JPEG (PSNRfor the overall image = 28.3 dB), (c) quadtree compression (PSNR= 30.5 dB), and (d) multitree compression (PSNR = 31.9 dB).

SPIHT, and our multitree coder with arithmetic coding. The right column of the

figure displays the bit rates as percentages of the multitree bit rate. For “goldhill”

(top row) and “cameraman” (bottom row), our algorithm clearly outperforms both

JPEG2000 and SPIHT. It also does better than SPIHT for “barbara” (second row)

and better than JPEG2000 for “lenna” (third row).

78

26 28 30 32 34 36 38 40 42 440

0.5

1

1.5

2

2.5

3

PSNR IN DB

RA

TE

IN B

ITS

/PIX

EL

MULTITREE (ARITHMETIC)SPIHTJPEG2000

26 28 30 32 34 36 38 40 42 44

100

102

104

106

108

110

112

114

RA

TE

AS

A P

ER

CE

NT

AG

E O

F T

HE

MU

LTIT

RE

E R

AT

E

PSNR IN DB

MULTITREE (ARITHMETIC)=100%SPIHTJPEG2000

25 30 35 40 450

0.5

1

1.5

2

2.5

PSNR IN DB

RA

TE

IN B

ITS

/PIX

EL


25 30 35 40 45

94

96

98

100

102

104

106

108

110

112

114

RA

TE

AS

A P

ER

CE

NT

AG

E O

F T

HE

MU

LTIT

RE

E R

AT

E

PSNR IN DB


28 30 32 34 36 38 40 42 44 46 480.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

PSNR IN DB

RA

TE

IN B

ITS

/PIX

EL


28 30 32 34 36 38 40 42 44 46

96

98

100

102

104

106

108

RA

TE

AS

A P

ER

CE

NT

AG

E O

F T

HE

MU

LTIT

RE

E R

AT

E

PSNR IN DB


26 28 30 32 34 36 38 40 42 44 460

0.5

1

1.5

2

2.5

PSNR IN DB

RA

TE

IN B

ITS

/PIX

EL


26 28 30 32 34 36 38 40 42 44 46

100

105

110

115

120

125

130

RA

TE

AS

A P

ER

CE

NT

AG

E O

F T

HE

MU

LTIT

RE

E R

AT

E

PSNR IN DB


Fig. 3.12. Rate-Distortion curves for “goldhill”(top row), “barbara”(second row), “lenna” (third row), and “cameraman” (bottom row).The right column shows bit rates as percentages of the bit rate forthe multitree algorithm with arithmetic coding of the coefficients.

79

3.6 Multitree Dictionaries

We now generalize our algorithms of Sections 3.2, 3.3, and 3.4.1 and show that

they are all instances of one general algorithm which is applicable to a wide variety

of scenarios.

Tree models such as those of Sections 3.2, 3.3, and 3.4.1 are conveniently described

using the formalism of grammars. We define a grammar G = (A, S) to be a pair of

the following sets:

• a set A of symbols,4 and

• a set S of allowed splits, also called productions, of the form a → α where

a ∈ A, and α is a finite sequence of elements of A.

For example, in Section 3.4.1, the symbols are pairs (P, x) where P is a rectangu-

lar region and x ∈ {1, . . . , X}, and the productions are all of the form (P, x) →(P ′, x′) (P ′′, x′′) where P ′ and P ′′ are two rectangles which partition P .

By starting with a single element of A, we can generate various sequences of

elements of A via recursive splitting—i.e., recursive application of productions. This

process can be visualized as a tree where each production a → α is depicted as a

node labeled a whose children are labeled with the elements of α, left to right. We

let T (G) be the set of all trees that can be produced5 by the grammar G.

Note that in the previous sections, the splitting process was binary and led to

binary trees. Here, we allow splits into an arbitrary finite number of symbols.

We let a multitree dictionary Ta(G) be the set of all trees in T (G) whose root is

labeled a. We say that a grammar G = (A, S) is finite-depth if, for every a ∈ A,

4This is somewhat different from standard treatments of grammars [73] which distinguish between the start symbol

which can only appear at the root, the nonterminal symbols which can only appear at the nonroot internal nodes,

and terminal symbols which can only appear at the leaves. We, on the other hand, assume that any symbol in A

can appear at the root or any internal nodes or leaf nodes.5We assume that each branch of our recursive tree generation process can stop after any number of recursions. This

is different from standard treatments of grammars [73] where the stopping is handled via distinguishing between

nonterminal symbols which must have children, and terminal symbols which never have children.

80

Ta(G) is a finite set. This can be insured by only allowing a finite set of symbols to

be descendants of a, and not allowing a to be its own descendant.

Suppose that each symbol u ∈ A is assigned a cost c(u), and that each production

u→ α ∈ S is assigned a cost c(u→ α). Suppose further that the cost cost(t) of any

tree t ∈ Ta(G) is the sum of the individual costs of all the productions comprising t,

plus the sum of the costs of all its leaves:

cost(t) =∑

u→α∈t

c(u→ α) +∑

u∈yield(t)

c(u). (3.12)

We would like to find the best tree in the dictionary Ta(G) i.e., the tree t∗a whose

cost is the smallest:

t∗a = arg mint∈Ta(G)

cost(t).

We denote the corresponding cost by C∗a , i.e., C∗a = C(t∗a). We let Sa be the set of

all allowed splits of a fixed symbol a. To illustrate our fast recursive algorithm for

best tree search, we first suppose that Sa = {a → b1 b2}. Then there is a single

tree in Ta(G) which consists of one node labeled a with cost([a]) = c(a). For any

other tree t ∈ Ta(G), its left subtree tleft is in Tb1(G), and its right subtree tright is

in Tb2(G). Therefore, since the cost is additive,

cost(t) = c(a→ b1 b2) + cost(tleft) + cost(tright).

Consequently, the optimal tree is:

t∗a =

[t∗b1 t∗b2

a ]if c(a→ b1 b2) + C∗b1 + C∗b2 < c(a)

[a] otherwise.

In other words, we find the best trees t∗b1 and t∗b2 in the dictionaries Tb1(G) and

Tb2(G), respectively, and compare their total cost plus the cost of the root production

a→ b1 b2, with the cost of the tree [a].

We have a similar recursion in the general case. We let R(a) be the set of the

right-hand sides of all the elements of Sa. Then the possible candidates for t∗a are

81

266664 t∗b1 . . . t∗b|α|

a

377775

with cost c(a → α) +

|α|Xi=1

C∗bi

, for any α = (b1 b2 . . . b|α|) ∈ R(a), and

[a], with cost c(a).

To find the globally optimal t∗a, we recursively search over these possibilities. The

recursion terminates when Sa = ∅: in this case, t∗a = [a]. The termination is

guaranteed to happen in a finite number of steps for a finite-depth grammar. To avoid

repetitive calculation, we store the optimal costs and corresponding productions in a

global data structure called Table, as illustrated in the pseudocode of Fig. 3.13(a).

Once this recursive call is done, the best tree can be generated from Table using

the pseudocode in Fig. 3.13(b).

The most significant computational burden is in computing and storing the best

costs and productions. To analyze this procedure, we let A(a) be the union of {a}and the set of all symbols which can be descendants of a. We let SA(a) be the set of

all allowed splits of elements of A(a). For each symbol b ∈ A(a), there is exactly one

recursive call to the subroutine best split of Fig. 3.13(a). During this call, the costs

of all possible splits of b are compared. The number of such comparisons is |Sb|.Therefore, the overall time complexity of the algorithm is O(|SA(a)|). In applications

where only the yield of a tree is of interest, such as our rectangular tiling example of

Section 3.2, there is some redundancy associated with searching over multiple trees

which have the same yield. In some instances, such as in [64,65], this redundancy is

very significant and may be eliminated, leading to a lower time complexity.

The overall space complexity is O(|A(a)|) since we need to store two numbers—

the best cost and the best split—for each symbol in A(a). The key to controlling the

time and space complexity is therefore keeping the sizes of the sets SA(a) and A(a)

low. In addition, as we have remarked before, the computation of the costs c(a→ α)

and c(a) could actually dominate the time complexity of the overall algorithm, and

therefore another important guideline to a successful application of our algorithm is

to use tractable cost functions.

82

(C∗a , s∗a)← best split(a) {

if C∗a has been computed

get C∗a and s∗a from the global data structure Table;

else {s∗a ← ∅; //Initialize s∗aC∗

a ← c(a); //Initialize C∗a

for α ∈ R(a) {for b ∈ α

(C∗b , s∗b )← best split(b);

if c(a→ α) +Xb∈α

C∗b < C∗

a {

s∗a ← α;

C∗a ← c(a→ α) +

Xb∈α

C∗b ;

}}record C∗

a and s∗a in the global data structure Table;

}return C∗

a and s∗a;

}

(a) Recursive caclulation of best splits and costs.

t∗a ← best tree(a) {get s∗a from the global data structure Table;

if s∗a is the empty set

t∗a ← [a];

else {i← 0;

for b ∈ s∗a {i← i + 1;

bi ← b;

t∗bi← best tree(bi);

}

t∗a ←

266664 t∗b1 . . . t∗bi

a

377775

;

}return t∗a;

}

(b) Recursive generation of best tree.

Fig. 3.13. Pseudocode for the recursive calculation of the best splitsand best costs, and for the recursive generation of the globally opti-mal tree.

83

We note that the dynamic programming algorithm of Section 3.4.2 is easily gen-

eralized to the problem of finding the optimal tree in each of a sequence of multitree

dictionaries, provided that the overall cost has additive structure, as in Eq. (3.10).

3.7 Relationships with Prior Work

It can be easily shown that standard wavelet packet and dyadic local cosine

dictionaries [2, 3], as well as anisotropic 2-D wavelet packet dictionaries [30, 40], are

all multitree dictionaries. It is also easy to see that a specialization of our algorithm

of Fig. 3.13 to the wavelet packets and dyadic local cosines is essentially a restatement

of the best basis algorithm of [2,3], its specialization to anisotropic wavelet and cosine

packets is a restatement as the anisotropic best basis algorithm of [30, 40], and its

specialization to dyadic tiling is a restatement of the dyadic CART algorithm of [40].

Our algorithm can also be used for a variety of other dictionaries, such as, for

example, any dictionary of block or lapped bases in two or more dimensions. It is

interesting to point out that arbitrary block and lapped dictionaries in 1-D can be

efficiently searched without exploiting their tree structure, but rather using standard

dynamic programming techniques, as was shown in [64,65].

It was pointed out in [40] that there is a close relationship between the best

basis algorithm of [2,3] and pruning methods used in the design of classification and

regression trees [74]. These methods have also been used for vector quantization and

other applications [75]. These and other methods such as, for example, [34, 43, 76–

84], seek to optimally tile a multidimensional domain with dyadic hyperrectangles.

Our multitree algorithm can be applied to these problems, allowing one to lift the

requirement that the split locations be dyadic, and to optimally tile a domain with

arbitrary hyperrectangles.

We now point out a close relationship between our algorithm and procedures

for estimating the maximum a posteriori probability parse of a string [73, 85, 86] or

an image [87–90]. In these problems, −c(u → α) of Eq. (3.12) stands for the log-

84

probability of the production u → α, and the probability of a tree t is defined as

the product of the probabilities of all the productions in t. The objective of these

estimation tasks is to find the most probable tree, i.e., to minimize with respect

to t the negative-log-probability of the tree t,∑

u→α∈t

c(u→ α). But this is exactly

what our algorithm of Figs. 3.13 does. Thus, the estimation algorithms of [73, 85–

90] represent special cases of our search algorithm for the best tree in a multitree

dictionary.

3.8 Conclusions

We presented a general framework of multitree dictionaries and provided a re-

cursive algorithm for finding the best representation in a multitree dictionary. We

illustrated our framework and algorithm within the contexts of optimal rectangular

and wedgelet tilings and image compression, and designed a new block image coder.

The key property that enables our algorithm to be fast for any additive or multi-

plicative cost is the fact that, while the number of possible trees can be enormous,

the number of possible symbols at tree nodes is typically manageable. By storing

the optimal cost and the optimal set of children for each symbol in a global data

structure, the algorithm only needs to make one recursive call per symbol.

In the future we plan to further explore the flexibility of our framework and design

various other multitree dictionaries which allow a fast selection of the best represen-

tation in applications such as time-frequency analysis, approximation, embedded

image compression, video compression, vector quantization, and classification.

LIST OF REFERENCES

85

LIST OF REFERENCES

[1] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noiseremoval algorithms. Physica D, 60:259–268, 1992.

[2] R. R. Coifman and M. V. Wickerhauser. Entropy based algorithms for bestbasis selection. IEEE Trans. Inf. Th, 38(2):713–718, March 1992.

[3] R. R. Coifman, Y. Meyer, and M. V. Wickerhauser. Wavelet analysis and signalprocessing. In M. B. Ruskai et al., editor, Wavelets and Their Applications,pages 153–178. Jones and Bartlett, Boston, 1992.

[4] I. Cohen, S. Raz, and D. Malah. Orthonormal shift-invariant adaptive localtrigonometric decomposition. Sig. Proc., 57(1):43–64, February 1997.

[5] I. Pollak. Segmentation and noise suppression via nonlinear multiscale filtering.IEEE Signal Processing Magazine, September 2002.

[6] P. C. Teo, G. Sapiro, and B. Wandell. Anisotropic smoothing of posteriorprobabilities. In Proc. ICIP, Santa Barbara, CA, 1997.

[7] S. C. Zhu and D. Mumford. Prior learning and gibbs reaction-diffusion. IEEETrans. on PAMI, 19(11), 1997.

[8] S. Kh. Djumagazieva. Numerical integration of a certain partial differentialequation. U.S.S.R. Comput. Maths. Math. Phys., 23(4):45–49, 1983.

[9] P. Perona and J. Malik. Scale-space and edge detection using anisotropic diffu-sion. IEEE Trans. on PAMI, 12(7), 1990.

[10] S. C. Zhu and A. Yuille. Region competition: unifying snakes, region growing,and Bayes/MDL for multiband image segmentation. IEEE Trans. on PAMI,18(9), 1996.

[11] M. J. Black, G. Sapiro, D. H. Marimont, and D. Heeger. Robust anisotropicdiffusion. IEEE Trans. on Image Processing, 7(3):421–432, 1998.

[12] H. Krim and Y. Bao. A stochastic diffusion approach to signal denoising. InProc. ICASSP, Phoenix, AZ, 1999.

[13] I. Pollak, A. S. Willsky, and H. Krim. A nonlinear diffusion equation as a fastand optimal solver of edge detection problems. In Proc. ICASSP, Phoenix, AZ,1999.

[14] I. Pollak. Nonlinear Scale-Space Analysis in Image Processing. PhD thesis,Laboratory for Information and Decision Systems, MIT, 1999. LIDS-TH-2461.

86

[15] C. Bouman and K. Sauer. An edge-preserving method for image reconstructionfrom integral projections. In Proc. Conf. on Info. Sci. and Syst., pages 383–387,Johns Hopkins University, Baltimore, MD, March 1991.

[16] K. Sauer and C. Bouman. Bayesian estimation of transmission tomograms usingsegmentation based optimization. IEEE Trans. on Nuclear Science, 39(4):1144–1152, August 1992.

[17] A. Chambolle and P. L. Lions. Image recovery via total variation minimizationand related problems. Numer. Math., 76:167–188, 1997.

[18] R. Acar and C. R. Vogel. Analysis of bounded variation penalty methods forill-posed problems. Inverse Problems, 10(6):1217–1229, 1994.

[19] S. Alliney. A property of the minimum vectors of a regularizing functionaldefined by means of the absolute norm. IEEE Trans. on Signal Processing,45(4):913–917, April 1997.

[20] S. Alliney and S. A. Ruzinsky. An algorithm for the minimization of mixed `1

and `2 norms with applications to bayesian estimation. IEEE Trans. on SignalProcessing, 42(3):618–627, March 1994.

[21] T. F. Chan, G. H. Golub, and P. Mulet. A nonlinear primal-dual method fortv-based image restoratio. In Proc. ICAOS: Images, Wavelets, and PDEs, pages241–252, Paris, France, June 1996.

[22] C. R. Vogel and M. E. Oman. Fast, robust total variation-based reconstructionof noisy, blurred images. IEEE Trans. on Image Processing, 7(6), June 1998.

[23] P. Blomgren and T. F. Chan. Modular solvers for constrained image restorationproblems. Numerical Linear Algebra with Applications, 9(5):347–358, 2002.

[24] D. Dobson and F. Santosa. An image enhancement technique for electricalimpedance tomography. Inverse Problems, 10:317–334, 1994.

[25] A. Marquina and S. Osher. Explicit algorithms for a new time dependent modelbased on level set motion for nonlinear deblurring and noise removal. SIAMJournal on Scientific Computing, 22(2):387–405, 2000.

[26] I. Pollak, A. S. Willsky, and H. Krim. Image segmentation and edge enhance-ment with stabilized inverse diffusion equations. IEEE Trans. on Image Pro-cessing, 9(2), February 2000.

[27] M. G. Fleming, C. Steger, J. Zhang, J. Gao, A. B. Cognetta, I. Pollak, andC. R. Dyer. Techniques for a structural analysis of dermatoscopic imagery.Computerized Medical Imaging and Graphics, 22, 1998.

[28] H. van Trees. Detection, Estimation, and Modulation Theory, volume 1. Wiley,1968.

[29] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms.MIT Press, 1990.

[30] N. N. Bennett. Fast algorithm for best anisotropic walsh bases and relatives.J. of Appl. and Comput. Harmonic Analysis, 8:86–103, 2000.

87

[31] K. Ramchandran and M. Vetterli. Best wavelet packet bases in a rate-distortionsense. IEEE Trans. Im. Proc., 2(2):160–175, Apr. 1993.

[32] M. Lindberg. Two-Dimensional Adaptive Haar-Walsh Tilings. Licentiat Thesisin Applied Mathematics, Abo Akademi University, Abo, Finland, October 1999.

[33] M. Lindberg and L. F. Villemoes. Image compression with adaptive Haar-Walsh tilings. In Wavelet Applications in Signal and Image Processing VIII,Proc. SPIE 4119, 2000.

[34] M. B. Wakin, J. K. Romberg, H. Choi, and R. G. Baraniuk. Rate-distortionoptimized image compression using wedgelets. In Proceedings of ICIP-2002,Rochester, New York, September 2002.

[35] H. Krim and J.-C. Pesquet. On the statistics of best bases criteria. In A. An-toniadis, editor, Wavelets and statistics, Lecture Notes in Statistics, pages 193–207. Springer-Verlag, 1995.

[36] J.-C. Pesquet, H. Krim, D. Leporini, and E. Hamman. Bayesian approach tobest basis selection. In Proc. ICASSP-96, pages 2634–2638, Atlanta, USA, May1996.

[37] H. Krim, D. Tucker, S. Mallat, and D. Donoho. On denoising and best signalrepresentation. IEEE Trans. Inf. Th., 45(7):2225–2238, Nov. 1999.

[38] S. G. Mallat. A Wavelet Tour of Signal Processing. Academic Press, secondedition, 1999.

[39] P. Moulin. Signal estimation using adapted tree-structured bases and the MDLprinciple. In Proc. IEEE-SP Int. Symp. TFTS, pages 141–143, Paris, June 1996.

[40] D. L. Donoho. CART and best-ortho-basis: A connection. Ann. Stat., 25:1870–1911, 1997.

[41] D. L. Donoho and I. M. Johnstone. Ideal denoising in an orthonormal basischosen from a library of bases. Comptes Rendus Acad. Sci., Ser. I 319:1317–1322, 1994.

[42] D. L. Donoho. Wedgelets: Nearly minimax estimation of edges. Ann. Statist.,27:859–897, 1999.

[43] R. M. Willett and R. D. Nowak. Platelets: a multiscale approach for recoveringedges and surfaces in photon-limited medical imaging. IEEE Trans. MedicalImaging, 22(3):332–350, March 2003.

[44] C. Herley, J. Kovacevic, K. Ramchandran, and M. Vetterli. Tilings of thetime-frequency plane: construction of arbitrary orthogonal bases and fast tilingalgorithms. IEEE Trans. Sig. Proc., 41(12):3341–3359, December 1993.

[45] C. Herley, Z. Xiong, K. Ramchandran, and M. T. Orchard. Joint space-frequency segmentation using balanced wavelet packet tree for least-cost imagerepresentation. IEEE Trans. Im. Proc., 6(9):1213–1230, September 1997.

[46] L. F. Villemoes. Adapted bases of time-frequency local cosines. Preprint, June1999, www.math.kth.se/old-home-pages/larsv/publ.html.

88

[47] C. M. Thiele and L. F. Villemoes. A fast algorithm for adapted time-frequencytilings. J. of Appl. and Comput. Harmonic Analysis, 3:91–99, 1996.

[48] Z. Xiong, K. Ramchandran, C. Herley, and M. T. Orchard. Flexible tree-structured signal expansions using time-varying wavelet packets. IEEE Trans.Sig. Proc., 45(2):333–345, February 1997.

[49] J. H. Rothweiler. Polyphase quadrature filters—a new subband coding tech-nique. In Proc. ICASSP-83, pages 1280–1283, Boston, MA, March 1983.

[50] R. R. Coifman and Y. Meyer. Remarques sur l’analyse de fourier a fenetre.C.R. Acad. Sci., pages 259–261, 1991.

[51] H. Malvar. Lapped transforms for efficient transform/subband coding. IEEETrans. ASSP, 38(6):969–978, June 1990.

[52] H. Malvar. Signal Processing with Lapped Transforms. Artech House, 1992.

[53] R. Bellman. On the approximation of curves by line segments using dynamicprogramming. Comm. ACM, 4(6):284, 1961.

[54] J.-C. Perez and E. Vidal. Optimum polygonal approximation of digitized curves.Pattern Recognition Letters, 15:743–750, August 1994.

[55] G. Papakonstantinou. Optimal polygonal approximation of digital curves. SignalProcessing, 8:131–135, 1985.

[56] O. A. Niamut and R. Heusdens. RD optimal time segmentations for the time-varying MDCT. To appear in Proceedings of European Signal Processing Con-ference (Eusipco), Vienna, Austria, September 6-10 2004.

[57] P. Prandoni and M. Vetterli. R/D optimal linear prediction. IEEE Trans. Speechand Audio Proc., 8(6):646–655, November 2000.

[58] O. A. Niamut and R. Heusdens. Flexible frequency decompositions for cosine-modulated filter banks. In Proc. ICASSP-2003, Hong Kong, April 2003.

[59] Y. Meyer. Principe d’incertitude, bases hilbertiennes et algebres d’operateurs.In Seminaire Bourbaki, volume 662, Paris, 1986.

[60] E. D. Kolaczyk. Wavelet Methods for the Inversion of Certain HomogeneousLinear Operators in the Presence of Noisy Data. PhD thesis, Department ofStatistics, Stanford University, October 1994.

[61] D. Donoho, M. R. Duncan, X. Huo, O. Levi, J. Buckheit, M. Clerc, J. Kalifa,S. G. Mallat, and T. Yu. Wavelab 802. www-stat.stanford.edu/~wavelab.

[62] P. M. Cassereau, D. H. Staelin, and G. De Jager. Encoding of images based ona lapped orthogonal transform. IEEE Trans. Comm., 37(2):189–193, February1989.

[63] P. M. Cassereau. A new class of optimal unitary transforms for image processing.Master’s thesis, EECS, MIT, May 1985.

89

[64] Y. Huang, I. Pollak, C. A. Bouman, and M. N. Do. New algorithms for bestlocal cosine basis search. In Proc. ICASSP-2004, Montreal, Quebec, May 17-212004. www.ece.purdue.edu/~ipollak/icassp04.pdf.

[65] Y. Huang, I. Pollak, C. A. Bouman, and M. N. Do. Best basis search in lappeddictionaries. Submitted to IEEE Trans. Sig. Proc.

[66] D. Xu and M. N. Do. Anisotropic 2-D wavelet packets and rectangular tiling:theory and algorithms. In Proc. SPIE Conf. on Wavelet Appl. in Sig. and Im.Proc. X, San Diego, Aug. 2003.

[67] U. Ndili. A coding theoretic approach to image segmentation. Master’s thesis,Rice University, Houston, Texas, April 2001.

[68] U. Ndili, R. D. Nowak, and M.A.T. Figueiredo. Coding theoretic approach toimage segmentation. In Proc. ICIP-2001.

[69] H. Cheng and C. A. Bouman. Document compression using rate-distortionoptimized segmentation. Journal of Electronic Imaging, 10(2):460–474, April2001.

[70] G. M. Schuster and A. K. Katsaggelos. A video compression scheme with op-timal bit allocation between displacement vector field and displaced frame dif-ference. In Proc. ICASSP-96, pages 1967–1970, Atlanta, GA, May 1996.

[71] D. Taubman. High performance scalable image compression with EBCOT.IEEE Trans. Im. Proc., 9(7):1158–1170, July 2000.

[72] A. Said and W. A. Pearlman. A new, fast, and efficient image codec basedon set partitioning in hierarchical trees. IEEE Trans. Circ. Syst. Vid. Tech.,6(3):243–250.

[73] C. Manning and H. Schutze. Foundations of Statistical Natural Language Pro-cessing. MIT Press, 1999.

[74] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification andRegression Trees. Chapman & Hall, New York, 1984.

[75] P. A. Chou, T. Lookabaugh, and R. M. Gray. Optimal pruning with applicationsto tree-structured source coding and modeling. IEEE Trans. Inf. Th., 35(2):299–315, March 1989.

[76] G. Blanchard, C. Schafer, and Y. Rozenholc. Oracle bounds and exact algorithmfor dyadic classification trees. In Proc. 17th. Conf. on Learning Theory. SpringerLecture Notes in Artificial Intelligence, volume 3120, pages 378–392, 2004.

[77] J. Vaisey and A. Gersho. Image compression with variable block size segmen-tation. IEEE Trans. Sig. Proc., 40(8):2040–2060, August 1992.

[78] G. J. Sullivan and R. L. Baker. Efficient quadtree coding of images and video.IEEE Trans. Im. Proc., 3(3):327–331, May 1994.

[79] R. Leonardi and M. Kunt. Adaptive split-and-merge for image analysis andcoding. In Proc. SPIE, volume 594, pages 2–9, 1985.

90

[80] R. M. Figueras i Ventura, L. Granai, and P. Vendergheynst. R-D analysis ofadaptive edge representations. In Proc. IEEE Workshop MMSP, pages 130–133,December 2002.

[81] R. Shukla, P. L. Dragotti, M. N. Do, and M. Vetterli. Rate-distortion optimizedtree structured compression algorithms for piecewise smooth images. IEEETrans. Im. Proc., to appear.

[82] C. Scott and R. D. Nowak. Minimax-optimal classification with dyadic decisiontrees. Preprint, 2004, www.stat.rice.edu/~cscott/pubs.html.

[83] E. Le Pennec and S. G. Mallat. Sparse geometric im-age representations with bandelets. Preprint, 2003,www.cmap.polytechnique.fr/~mallat/biblio.html.

[84] M. Wien. Variable block-size transforms for H.264/AVC. IEEETrans. Ckts. Syst. Vid. Tech., 13(7):604–613.

[85] J. Baker. Trainable grammars for speech recognition. In D. Klatt and J. Wolf,editors, Speech Communications Papers for the 97th Meeting of the AcousticalSociety of America, pages 547–550, 1979.

[86] K. Lari and S. Young. The estimation of stochastic context-free grammars usingthe inside-outside algorithm. Computer Speech and Language, 4:35–56, 1990.

[87] I. Pollak, J. M. Siskind, M. P. Harper, and C. A. Bouman. Modeling and esti-mation of spatial random trees with application to image classification. In Proc.ICASSP, Hong Kong, 2003. www.ece.purdue.edu/~ipollak/icassp03.pdf.

[88] I. Pollak, J. M. Siskind, M. P. Harper, and C. A. Bouman. Parameter estimationfor spatial random trees using the EM algorithm. In Proc. ICIP, Barcelona,2003. www.ece.purdue.edu/~ipollak/icip03.pdf.

[89] I. Pollak, J. M. Siskind, M. P. Harper, and C. A. Bouman. Spa-tial random trees and the center-surround algorithm. Technical Re-port TR-ECE-03-03, Purdue University, School of ECE, January 2003.www.ece.purdue.edu/~ipollak/it03.pdf.

[90] J. M. Siskind, J. Sherman, I. Pollak, M. P. Harper, and C. A. Bouman. Spatialrandom tree grammars for modeling hierarchal structure in images. Preprint,May 2004. www.ece.purdue.edu/~ipollak/draft2004 5 25.pdf.

APPENDICES

91

APPENDICES

Appendix A: Proof of Proposition 4

The proof of the central result of Chapter 1–Proposition 5 of Section 1.5–uses Eq.

(1.18) which follows from Proposition 4 of Section 1.3. We prove this proposition

below. The proof is organized in several modules: the main body of the proof, and

auxiliary lemmas which prove one equality and two inequalities used in the main

body.

A.1 Proof of Four Auxiliary Lemmas

In the first lemma, we consider the solution u(t) of Eq. (1.2,1.3) at some fixed time

instant t = τ . For the case when u(τ) has an edge at (n, n+1) (i.e., un(τ) 6= un+1(τ)),

we calculate the rate of change of the cumulative sum∑n

k=1 uk(t) during the time

interval 0 ≤ t ≤ τ .

Lemma 1 Let u(t) be the solution of Eq. (1.2,1.3), with final hitting time tf , and

let τ be a fixed time instant with 0 ≤ τ < tf . Suppose that index n is such that

un(τ) < un+1(τ). (A.1)

Then

d

dt

n∑k=1

uk(t) = 1 for any t ∈ [0, τ ]. (A.2)

Similarly,

if un(τ) > un+1(τ), thend

dt

n∑k=1

uk(t) = −1 for any t ∈ [0, τ ]. (A.3)

92

Proof. Suppose (A.1) holds. Then necessarily

un(t) < un+1(t), for any t ∈ [0, τ ]. (A.4)

Indeed, suppose this was not true–i.e., there was a time instant t ∈ [0, τ ] for which

un(t) ≥ un+1(t). Then, owing to the continuity of the solution, there would be a

time instant T ∈ [t, τ ] such that uj(T ) = uj+1(T ). At that time instant, the samples

n and n + 1 would be merged, and their intensities would stay equal for all future

time, resulting in un(τ) = un+1(τ) and contradicting (A.1).

It follows from (A.4) that samples n and n+1 belong to different regions of u(t),

for any t ∈ [0, τ ], which means that n + 1 is always the left endpoint of a region.

Suppose that at some time instant t, the point n + 1 is the left endpoint of region j:

n + 1 = nj(u(t)).

It follows from Eq. (1.2) that the time derivatives of all samples of the i-th region

of u(t) are the same and equal to µi(u(t)), and therefore

d

dt

n∑k=1

uk(t) =n∑

k=1

uk(t) =

j−1∑i=1

mi(u(t))µi(u(t))

Eq. (1.2)=

j−1∑i=1

{sgn[µi+1(u(t))− µi(u(t))]− sgn[µi(u(t))− µi−1(u(t))]}

= sgn[µj(u(t))− µj−1(u(t))] = sgn[un+1(t)− un(t)] = 1.

It is similarly shown that Eq. (A.3) holds.

In the next lemma, we consider a region of the solution u(t) to Eq. (1.2,1.3) at a

particular time instant τ , and use Lemma 1 to calculate the rate of change for the

sum of intensities within the region for t ∈ [0, τ ].

Lemma 2 Let u(t) be the solution of Eq. (1.2,1.3), with final hitting time tf , and

let τ be a fixed time instant with 0 ≤ τ < tf . Then

d

dt

ni+1(u(τ))−1∑k=ni(u(τ))

uk(t) = −βi(u(τ))ρi(u(τ)), for any t ∈ [0, τ ]. (A.5)

In other words, for any region of u(τ), the sum of values within the region evolves

with a constant velocity (given by the right-hand side of Eq. (A.5)) from t = 0 until

t = τ .

93

Proof. Representing the left-hand side of Eq. (A.5) as

d

dt

k1∑k=1

uk(t)− d

dt

k2∑k=1

uk(t),

with k1 = ni+1(u(τ))−1 and k2 = ni(u(τ))−1, we get Eq. (A.5) as a direct corollary

of Lemma 1.

The next lemma essentially says that a local averaging operation cannot result in

an increase of the total variation. This is natural to expect, since the total variation

is a measure of “roughness”.

Lemma 3 Let x,u ∈ RN be arbitrary signals, and let p = p(u); ni = ni(u) for

i = 1, . . . , p; and np+1 = N + 1. Let x∗ ∈ RN be the result of averaging x over the

regions of u:

x∗ni= x∗ni+1 = . . . = x∗ni+1−1 =

1

ni+1 − ni

ni+1−1∑k=ni

xk, for i = 1, . . . , p.

Then TV (x∗) ≤ TV (x).

Proof. We introduce the following notation:

xi,max = maxni≤k≤ni+1−1

xk, for i = 1, . . . , p;

xi,min = minni≤k≤ni+1−1

xk, for i = 1, . . . , p;

si =

xi+1,max − xi,min if x∗ni+1

≥ x∗ni

xi,max − xi+1,min if x∗ni+1< x∗ni

for i = 1, . . . , p− 1.

Note thatp−1∑i=1

si ≤N−1∑n=1

|xn+1 − xn| = TV (x). (A.6)

Since x∗ni, xi,min, and xi,max are the mean, min, and max, respectively, of the numbers

{xk}ni+1−1k=ni

, we have: xi,min ≤ x∗ni≤ xi,max. Therefore, it follows from the first line

of the definition of si that, if x∗ni+1≥ x∗ni

, then

|x∗ni+1− x∗ni

| = x∗ni+1− x∗ni

≤ xi+1,max − xi,min = si.

94

Similarly, from the second line of the definition of si we have that, if x∗ni+1< x∗ni

,

then

|x∗ni+1− x∗ni

| = x∗ni− x∗ni+1

≤ xi,max − xi+1,min = si.

In both cases, |x∗ni+1− x∗ni

| ≤ si. Summing both sides of this inequality from i = 1

to i = p− 1 and using Eq. (A.6), we get:

TV (x∗) ≤p−1∑i=1

si ≤ TV (x).

The next lemma says that if, in formula (1.1), one uses incorrect locations of the

extrema, the result will be less than or equal to the actual total variation.

Lemma 4 Let x1,x2 ∈ RN be two signals whose segmentation parameters are iden-

tical, except for βi’s. In other words, the extrema of the two signals do not necessarily

occur at the same locations, but p(x1) = p(x2), and ni(x1) = ni(x

2) for i = 1, . . . , p.

Then

TV (x1) ≥p∑

i=1

βi(x2)ρiµi(x

1). (A.7)

Proof. Let i1 < i2 < . . . < iq be the regions which are the extrema of x2. Without

loss of generality, suppose that the leftmost extremum, i1, is a minimum. The right-

hand side of Eq. (A.7) can then be re-written as follows:

p∑i=1

βi(x2)ρiµi(x

1)

= −µi1(x1) + 2

q−1∑r=2

(−1)rµir(x1) + (−1)qµiq(x

1)

= (µi2(x1)− µi1(x

1)) + (µi2(x1)− µi3(x

1)) + (µi4(x1)− µi3(x

1)) + . . .

=

q−1∑r=1

|µir+1(x1)− µir(x

1)|

≤p(x1)−1∑

i=1

|µi+1(x1)− µi(x

1)| = TV (x1).

95

A.2 Proof of Proposition 4

Note that, by Proposition 1, we can differentiate at all time points except possibly

the hitting times.

1

2

d

dt‖u(t)− x‖2 =

1

2

d

dt

N∑n=1

(un(t)− xn)2 =N∑

n=1

(un(t)− xn)un(t)

=N∑

n=1

un(t)un(t)−N∑

n=1

xnun(t). (A.8)

Let us now calculate the two terms in (A.8) separately. In this first calculation, all

segmentation parameters are those of u(t).

N∑n=1

un(t)un(t) =

p∑i=1

ni+1−1∑k=ni

uk(t)uk(t)Eq. (1.2)

=

p∑i=1

miµiµiEq. (1.9)

= −p∑

i=1

βiρiµi

Eq. (1.1)= −TV (u(t)). (A.9)

When x = u(tν) with tν ≥ t, the second term of (A.8) is evaluated as follows (where

now the segmentation parameters are those of u(tν)):

−N∑

n=1

xnun(t) = −N∑

n=1

un(tν)un(t) = −p∑

i=1

ni+1−1∑k=ni

uk(tν)uk(t)

= −p∑

i=1

µi

ni+1−1∑k=ni

uk(t)

Lemma 2= −

p∑i=1

µi(−βiρi) (A.10)

Eq. (1.1)= TV (u(tν))

= ν. (A.11)

96

Using segmentation parameters of u(t) in this third calculation, we have, for a general

x with TV (x) ≤ ν:

−N∑

n=1

xnun(t) = −p∑

i=1

ni+1−1∑k=ni

xkuk(t)Eq. (1.9)

= −p∑

i=1

(ni+1−1∑k=ni

xk

)(−βiρi

mi

)

=

p∑i=1

(mix∗ni

)

(βiρi

mi

)

=

p∑i=1

βiρix∗ni

Lemma 4≤ TV (x∗)Lemma 3≤ TV (x)

≤ ν, (A.12)

where x∗ is as defined in Lemma 3–i.e., obtained by averaging x over the regions of

u(t). Substituting (A.9,A.11,A.12) into Eq. (A.8), we obtain the equations (1.13,1.14)

that we needed to verify.

Appendix B: A Suboptimal Strictly Shift-Invariant Algorithm

Recall that M is the size of the finest cell we are considering, and L = N/M

is the total number of such cells. We let s = lM + m where l is the cell where s

appears, 0 ≤ l < L, and m is the position of s within the cell, 0 ≤ m < M . Instead

of finding s∗, i.e., optimizing over m and l jointly, we first optimize over m and then

optimize over l. We optimize over m using the method described in [4]. For each

shift m = 0, 1, . . . , M − 1, define the following basis Bm:

Bm =L−1⋃l=0

BlM+m,(l+1)M+m.

We calculate the cost Cm of approximating f with the basis Bm:

Cm =L−1∑l=0

C(f,BlM+m,(l+1)M+m).

Then m∗ is found by minimizing the cost Cm over m:

m∗ = arg min0≤m≤M−1

Cm.

97

Using this value of m∗, we now choose l. Let OlM+m∗,N+lM+m∗ be the best basis

for the signal flM+m∗,N+lM+m∗ in the sub-dictionary DlM+m∗ , for l = 0, 1, . . . , L− 1.

The best basis for each l is calculated in the same way as in Section 2.2.3. Then we

choose the best l:

l∗ = arg min0≤l≤L−1

C(f,OlM+m∗,N+lM+m∗).

The corresponding (suboptimal) shift-invariant best basis O for the signal f is

Ol∗M+m∗,N+l∗M+m∗ . Finding m∗ requires calculating Cm for m = 0, 1, . . . , M − 1.

Each Cm is the sum of L costs, and calculating each cost requires O(M log M) op-

erations. So the calculation of m∗ requires O(M · L · M log M) = O(NM log M)

operations. To determine l∗, we do basis searches for L signals of length N us-

ing the method of Section 2.2.3. We know that the recursion formula (2.6) takes

O(L2) operations for one search, so L searches require O(L3) operations. The major

part of the computational burden, however, is still the calculation of the costs in

Eq. (2.6). Fortunately, many of the costs used in the L searches are repeated. We

only need to compute the values for C(f,Bu+m∗,v+m∗) with u = pM, v = qM where

p = 0, 1, . . . , L−1 and q = p+1, p+2, . . . , p+L. It can be shown that this computa-

tion has time complexity O(L2N log N), the same as in Section 2.2.3. The two steps

taken together, therefore, result in the overall complexity of O((L2 + M)N log N),

which is similar to the complexity of the basic mod-M algorithm. For comparison,

the algorithm of [4] is O((log L + M)N log N).

As remarked above, the “best” basis found by this method, is suboptimal for

the enlarged dictionary. However, in this way, we achieved strict shift-invariance

essentially without increasing the time complexity of the algorithm.

We now show how to find the optimal basis which is strictly M -shift-invariant.

We define the following dictionary:

DSIM =⋃

s is a multiple of M

Ds,

where the subdictionaries Ds are defined as previously. We can then adapt the

algorithm we just described to find the best basis in the dictionaryDSIM . Specifically,

98

we omit the step of finding m∗ and set m∗ = 0. The resulting best basis will be

optimal for the dictionaryDSIM , and it will be M -shift-invariant. The computational

complexity for this procedure is O(L2N log N), the same as that of the mod-M

algorithm.

Appendix C: Time Complexity of Various Algorithms

In the following table, we summarize the time complexity of the various algo-

rithms introduced in this paper, as well as the SI-LCD algorithm of [4] and the

dyadic tree search of [2,3]. Here, N is the size of the signal, M is the finest cell size,

L = N/M , L2 is the number of blocks, M2 = N/L2, and L = M2/M .

Algorithm Time complexity

Dyadic [2, 3] O(log L · N log N)

SI-LCD [4] O((M + log L) · N log N)

Mod-M (Section 2.2.3) O(L2 · N log N)

SI-mod-M (Section 2.2.5) O((M + L2) · N log N)

Min-M (Section 2.3.1) O(N2 · N log N)

Blocks with blockwise mod-M (Section 2.3.2) O(NL2log M2 + L2 · N log N)

Blocks with blockwise min-M (Section 2.3.2) O(NM22 log M2 + L2 · N log N)

Overlapping-blocks with blockwise mod-M (Section 2.3.3) O(L2 · N log N)

Overlapping-blocks with blockwise min-M (Section 2.3.3) O(N2 · N log N)

99

VITA

Yan Huang received her B.Eng. degree (with highest honor) in electronic en-

gineering from Tsinghua University, Beijing, China, in 2000. Since Fall 2000, she

has been pursuing a direct Ph.D. degree in the School of Electrical and Computer

Engineering at Purdue University, West Lafayette, Indiana. She has been a research

assistant in the Video and Image Systems Engineering (VISE) lab under the super-

vision of Professor Ilya Pollak since Fall 2000. Her research interests include signal

and image compression, noise removal, segmentation, wavelet signal processing, sta-

tistical image modeling, and diffusion equations.

Yan Huang’s publications for her research work at Purdue include:

Journal papers:

• Y. Huang, I. Pollak, M.N. Do, and C.A. Bouman. Fast search for best rep-

resentations in multitree dictionaries. Submitted to IEEE Trans. on Image

Processing

• Y. Huang, I. Pollak, C.A. Bouman, and M.N. Do. Best basis search in lapped

dictionaries. Submitted to IEEE Trans. on Signal Processing

• I. Pollak, A.S. Willsky, and Y. Huang. Nonlinear Evolution Equations as Fast

and Exact Solvers of Estimation Problems. To appear in IEEE Trans. on

Signal Processing, 53(2), February 2005

Conference papers:

• Y. Huang, I. Pollak and C.A. Bouman. Image Compression with Multitree

Tilings. To appear in Proceedings of ICASSP, March 19-23, 2005, Philadelphia,

PA.

100

• Y. Huang, I. Pollak, C.A. Bouman, and M.N. Do. New algorithms for best local

cosine basis search. In Proceedings of ICASSP, May 17-21, 2004, Montreal,

Quebec.

• Y. Huang, I. Pollak, C.A. Bouman, and M.N. Do. Time-Frequency Analysis

with Best Local Cosine Bases. In Proc. IS&T/SPIE Computational Imaging II

Conference, January 2004, San Jose, CA.

• Y. Huang, I. Pollak, M.N. Do, and C.A. Bouman. Optimal tilings and best basis

search in large dictionaries. In Proceedings of the 37th Asilomar Conference on

Signals, Systems, and Computers, November 9-12, 2003, Pacific Grove, CA.

nonlinear multiscale methods for estimation…ipollak/yan_huang_thesis.pdf · nonlinear multiscale...

Documents