an introduction to sparse stochastic processeslototsky/math606/bookch1-3.pdf · an introduction to...

An introduction to Sparse Stochastic Processes Copyright © M. Unser and P.D. Tafti

An introduction to sparse stochastic

processes

Michael Unser and Pouya Tafti

Copyright © 2012 Michael Unser and Pouya Tafti. These materials are protected by

copyright under the Attribution-NonCommercial-NoDerivs 3.0 Unported License from

Creative Commons.



Summary

Sparse stochastic processes are continuous-domain processes that admit a parsimoniousrepresentation in some matched wavelet-like basis. Such models are relevant for imagecompression, compressed sensing, and, more generally, for the derivation of statisticalalgorithms for solving ill-posed inverse problems.

This book introduces an extended family of sparse processes that are specified by a gen-eric (non-Gaussian) innovation model or, equivalently, as solutions of linear stochasticdifferential equations driven by white Lévy noise. It presents the mathematical tools fortheir characterization. The two leading threads that underly the exposition are– the statistical property of infinite divisibility, which induces two distinct types of

behavior—Gaussian vs. sparse—at the exclusion of any other;– the structural link between linear stochastic processes and spline functions which is

exploited to simplify the mathematics.The last chapter is devoted to the use of these models for the derivation of algorithmsthat recover sparse signals. This leads to a Bayesian reinterpretation of popular sparsity-promoting processing schemes—such as total-variation denoising, LASSO, and waveletshrinkage—as MAP estimators for specific types of Lévy processes.

The book, which is mostly self-contained, is targeted to an audience of graduate stu-dents and researchers with an interest in signal/image processing, compressed sensing,approximation theory, machine learning, or statistics.

i



Chapter 1

Introduction

1.1 Sparsity: Occam’s razor of modern signal processing?

The hypotheses of Gaussianity and stationarity play a central role in the standard statist-ical formulation of signal processing. They fully justify the use of the Fourier transformas the optimal signal representation and naturally lead to the derivation of optimal linearfiltering algorithms for a large variety of statistical estimation tasks. This classical view ofsignal processing is elegant and reassuring, but it not at the forefront of research anymore.

Starting with the discovery of the wavelet transform in the late 80s [Dau88, Mal89], resear-chers in signal processing have progressively moved away from the Fourier transform andhave uncovered powerful alternatives. Consequently, they have ceased modeling signalsas Gaussian stationary processes and have adopted a more deterministic, approximation-theoretic point of view. The key developments that are presently reshaping the field, andwhich are central to the theory presented in this monograph, are summarized below.

Novel transforms and dictionaries for the representation of signals: New redundantand non-redundant representations of signals (wavelets, local cosine, cur-velets) haveemerged during the past two decades and have led to better algorithms for data com-pression, data processing, and feature extraction. The most prominent example is thewavelet-based JPEG-2000 standard for image compression [CSE00], which outperformsthe widely-used JPEG method based on the DCT (discrete cosine transform). Anotherillustration is wavelet-domain image denoising which provides a good alternative tomore traditional linear filtering [Don95]. The various dictionaries of basis functionsthat have been proposed so far are tailored to specific types of signals; there does notappear to be one that fits all.

Sparsity as a new paradigm for signal processing: At the origin of this new trend isthe key observation that many naturally-occurring signals and images—in particular,the ones that are piecewise-smooth—can be accurately reconstructed from a “sparse”wavelet expansion that involves much fewer terms than the original number of samples[Mal98]. The concept of sparsity has been systematized and extended to other trans-forms, including redundant representations (a.k.a. frames); it is at the heart of re-cent developments in signal processing. Sparse signals are easy to compress and todenoise by simple pointwise processing (e.g., shrinkage) in the transformed domain.Sparsity provides an equally-powerful framework for dealing with more difficult, ill-posed signal-reconstruction problems [CW08, BDE09]. Promoting sparse solutions inlinear models is also of interest in statistics: a popular regression shrinkage estimatoris LASSO, which imposes an upper bound on the `1-norm of the model coefficients[Tib96].

1



1. INTRODUCTION

New sampling strategies with fewer measurements: The theory of compressed sensingdeals with the problem of the reconstruction of a signal from a minimal, but suitablychosen, set of measurements [Don06, CW08, BDE09]. The strategy there is as fol-lows: among the multitude of solutions that are consistent with the measurements,one should favor the “sparsest” one. In practice, one replaces the underlying `0-normminimization problem, which is NP hard, by a convex `1-norm minimization which iscomputationally much more tractable. Remarkably, researchers have shown that thissimplification does yield the correct solution under suitable conditions (e.g., restrictedisometry) [CW08]. Similarly, it has been demonstrated that signals with a finite rate ofinnovation (the prototypical example being a stream of Dirac impulses with unknownlocations and amplitudes) can be recovered from a set of uniform measurements attwice the “innovation rate” [VMB02], rather than twice the bandwidth, as would other-wise be dictated by Shannon’s classical sampling theorem.

Superiority of nonlinear signal-reconstruction algorithms: There is increasing empiricalevidence that nonlinear variational methods (non-quadratic or sparsity-driven regular-ization) outperform the classical (linear) algorithms (direct or iterative) that are beingused routinely for solving bioimaging reconstruction problems [CBFAB97, FN03]. Sofar, this has been demonstrated for the problem of image deconvolution and for thereconstruction of non-Cartesian MRI [LDP07]. The considerable research effort in thisarea has also resulted in the development of novel algorithms (ISTA, FISTA) for solvingconvex optimization problems that were previously considered out of numerical reach[FN03, DDDM04, BT09].

1.2 Sparse stochastic models: The next step beyond Gaussianity

While the recent developments listed above are truly remarkable and have resulted in sig-nificant algorithmic advances, the overall picture and understanding is still far from beingcomplete. One limiting factor is that the current formulations of compressed sensing andsparse-signal recovery are fundamentally deterministic. By drawing on the analogy withthe classical linear theory of signal processing, where there is an equivalence betweenquadratic energy-minimization techniques and minimum-mean-square-error (MMSE)estimation under the Gaussian hypothesis, there are good chances that further progressis achievable by adopting a complementary statistical-modeling point of view 1. The cru-

1. It is instructive to recall the fundamental role of statistical modeling in the development of traditionalsignal processing. The standard tools of the trade are the Fourier transform, Shannon-type sampling, linearfiltering, and quadratic energy-minimization techniques. These methods are widely used in practice: Theyare powerful, easy to deploy, and mathematically convenient. The important conceptual point is that they arejustifiable based on the theory of Gaussian stationary processes (GSP). Specifically, one can invoke the followingoptimality results:

The Fourier transform as well as several of its real-valued variants (e.g., DCT) are asymptotically equivalent tothe Karhunen-Loève transform (KLT) for the whole class of GSP. This supports the use of sinusoidal transformsfor data compression, data processing, and feature extraction. The underlying notion of optimality here isenergy compaction, which implies decorrelation. Note that the decorrelation is equivalent to independencein the Gaussian case only.Optimal filters : Given a series of linear measurements of a signal corrupted by noise, one can readily specifyits optimal reconstruction (LMMSE estimator) under the general Gaussian hypothesis. The correspondingalgorithm (Wiener filter) is linear and entirely determined by the covariance structure of the signal and noise.There is also a direct connection with variational reconstruction techniques since the Wiener solution canalso be formulated as a quadratic energy-minimization problem (Gaussian MAP estimator).Optimal sampling/interpolation strategies: While this part of the story is less known, one can also invokeestimation-theoretic arguments to justify a Shannon-type, constant-rate sampling, which ensures a min-imum loss of information for a large class of predominantly-lowpass GSP [PM62, Uns93]. This is not totallysurprising since the basis functions of the KLT are inherently bandlimited. One can also derive minimummean-square-error interpolators for GSP in general. The optimal signal-reconstruction algorithm takes theform of a hybrid Wiener filter whose input is discrete (signal samples) and whose output is a continuously-

2



1.2. Sparse stochastic models: The next step beyond Gaussianity

cial ingredient that is required to guide such an investigation is a sparse counterpart tothe classical family of Gaussian stationary processes (GSP). This monograph focuses onthe formulation of such a statistical framework, which may be aptly qualified as the nextstep after Gaussianity under the functional constraint of linearity.

In light of the elements presented in the introduction, the basic requirements for a com-prehensive theory of sparse stochastic processes are as follows:

Backward compatibility: There is a large body of literature and methods based on themodeling of signals as realizations of GSP. We would like the corresponding identifica-tion, linear filtering, and reconstruction algorithms to remain applicable, even thoughthey obviously become suboptimal when the Gaussian hypothesis is violated. This callsfor an extended formulation that provides the same control of the correlation structureof the signals (second-order moments, Fourier spectrum) as the classical theory does.

Continuous-domain formulation: The proper interpretation of qualifying terms suchas “piecewise-smooth”, “translation-invariant”, “scale-invariant”, “rotation-invariant”calls for continuous-domain models of signals that are compatible with the conven-tional (finite-dimensional) notion of sparsity. Likewise, if we intend to optimize orpossibly redesign the signal-acquisition system as in generalized sampling and com-pressed sensing, the very least is to have a model that characterizes the informationcontent prior to sampling.

Predictive power: Among other things, the theory should be able to explain why wave-let representations can outperform the older Fourier-related types of decompositions,including the KLT, which is optimal from the classical perspective of variance concen-tration.

Ease of use: To have practical usefulness, the framework should allow for the deriva-tion of the (joint) probability distributions of the signal in any transformed domain.This calls for a linear formulation with the caveat that it needs to accommodate non-Gaussian distributions. In that respect, the best thing beyond Gaussianity is infinitedivisibility, which is a general property of random variables that is preserved under ar-bitrary linear combinations.

Stochastic justification and refinement of current reconstruction algorithms: A convin-cing argument for adopting a new theory is that it must be compatible with the stateof the art, while it also ought to suggest new directions of research. In the present con-text, it is important to be able to establish the connection with deterministic recoverytechniques such as `1-norm minimization.

The good news is that the foundations for such a theory exist and can be traced backto the pioneering work of Paul Lévy, who defined a broad family of “additive” stochasticprocesses, now called Lévy processes. Brownian motion (a.k.a. the Wiener process) is theonly Gaussian member of this family, and, as we shall demonstrate, the only represent-ative that does not exhibit any degree of sparsity. The theory that is developed in thismonograph constitutes the full linear, multidimensional extension of those ideas wherethe essence of Paul Lévy’s construction is embodied in the definition of Lévy innovations(or white Lévy noise), which can be interpreted as the derivative of a Lévy process in thesense of distributions (a.k.a. generalized functions). The Lévy innovations are then lin-early transformed to generate a whole variety of processes whose spectral characteristicsare controlled by a linear mixing operator, while their sparsity is governed by the innov-ations. The latter can also be viewed as the driving term of some corresponding linearstochastic differential equation (SDE). Another way of describing the extent of this gen-eralization is to consider the representation of a general continuous-domain Gaussian

defined signal that can be represented in terms of generalized B-spline basis functions [UB05b].

3



1. INTRODUCTION

process by a stochastic Wiener integral:

s(x) =Z

Rh(x, x 0) dW (x 0) (1.1)

where h(x, x 0) is the kernel—that is, the infinite-dimensional analog of the matrix repres-entation of a transformation in Rn—of a general, L2-stable linear operator. dW is theso-called random Wiener measure which is such that

W (x) =Zx

0dW (x 0)

is the Wiener process, where the latter equation constitutes a special case of (1.1) withh(x, x 0) = 1{x>x 0∏0}. There, 1≠ denotes the indicator function of the set ≠. If h(x, x 0) =h(x °x 0) is a convolution kernel, then (1.1) defines the whole class of Gaussian stationaryprocesses. The essence of the present formulation is to replace the Wiener measure by amore general non-Gaussian, multidimensional Lévy measure. The catch, however, is thatwe shall not work with measures but rather with generalized functions and generalizedstochastic processes. These are easier to manipulate in the Fourier domain and bettersuited for specifying general linear transformations. In other words, we shall rewrite (1.1)as

s(x) =Z

Rh(x, x 0)w(x 0) dx 0 (1.2)

where the entity w (the continuous-domain innovation) needs to be given a proper math-ematical interpretation. The main advantage of working with innovations is that theyprovide a very direct link with the theory of linear systems, which allows for the use ofstandard engineering notions such as the impulse and frequency responses of a system.

1.3 From splines to stochastic processes, or when Schoenberg meetsLévy

We shall start our journey by making an interesting connection between splines, whichare deterministic objects with some inherent sparsity, and Lévy processes with a spe-cial focus on the compound Poisson process, which constitutes the archetype of a sparsestochastic process. The key observation is that both categories of signals—namely, de-terministic and random—are ruled by the same differential equation. They can be gen-erated via the proper integration of an “innovation” signal that carries all the necessaryinformation. The fun is that the underlying differential system is only marginally stable,which requires the design of a special anti-derivative operator. We then use the close rela-tionship between splines and wavelets to gain insight on the ability of wavelets to providesparse representations of such signals. Specifically, we shall see that most non-GaussianLévy processes admit a better M-term representation in the Haar wavelet basis than inthe classical Karhunen-Loève transform (KLT) which is usually believed to be optimal fordata compression. The explanation for this counter-intuitive result is that we are breakingsome of the assumptions that are implicit in the proof of optimality of the KLT.

1.3.1 Splines and Legos revisited

Splines constitute a general framework for converting series of data points (or samples)into continuously-defined signals or functions. By extension, they also provide a powerfulmechanism for translating tasks that are specified in the continuous domain into efficientnumerical algorithms (discretization).

4



1.3. From splines to stochastic processes, or when Schoenberg meets Lévy0 5 10 15 20 25 30

0 0

Ê

Ê ÊÊ

Ê

Ê

Ê

ÊÊ

Ê

Ê

0 2 4 6 8 10

Ê

Ê ÊÊ

Ê

Ê

Ê

ÊÊ

Ê

Ê

0 2 4 6 8 10

0 2 4 6 8 10

(a)

(b)

(c)

Figure 1.1: Examples of spline signals. (a) Cardinal spline interpolant of degree 0(piecewise-constant). (b) Cardinal spline interpolant of degree 1 (piecewise-linear). (c)Nonuniform D-spline or compound Poisson process, depending on the interpretation(deterministic vs. stochastic).

The cardinal setting corresponds to the configuration where the sampling grid is on theintegers. Given a sequence of sample values f [k],k 2 Z, the basic cardinal interpolationproblem is to construct a continuously-defined signal f (x), x 2 R that satisfies the inter-polation condition f (x)

Ø

Ø

x=k = f [k], for all k 2 Z. Since the general problem is obviouslyill-posed, the solution is constrained to live in a suitable reconstruction subspace (e.g.,a particular space of cardinal splines) whose degrees of freedom are in one-to-one cor-respondence with the data points. The most basic concretization of those ideas is theconstruction of the piecewise-constant interpolant

f1(x) =X

k2Zf [k]Ø0

+(x °k) (1.3)

which involves rectangular basis functions (informally described as Legos) that are shiftedreplicates of the causal 2B-spline of degree zero

Ø0+(x) =

8

<

:

1, for 0 ∑ x < 1

0, otherwise.(1.4)

Observe that the basis functions {Ø0+(x °k)}k2Z are non-overlapping, orthonormal, and

that their linear span defines the space of cardinal polynomial splines of degree 0.Moreover, since Ø0

+(x) takes the value one at the origin and vanishes at all other integers,the expansion coefficients in (1.3) coincide with the original samples of the signal. Equa-tion (1.3) is nothing but a mathematical representation of the sample-and-hold methodof interpolation which yields the type of “Lego-like” signal shown in Fig. 1.1a.

A defining property of piecewise-constant signals is that they exhibit “sparse” first-orderderivatives that are zero almost everywhere, except at the points of transition where dif-ferentiation is only meaningful in the sense of distributions. In the case of the cardinal

2. A function f+(x) is said to be causal if f+(x) = 0, for all x < 0.

5



1. INTRODUCTION

1 2 3 4 5

1

�(0)(x) = 1+(x) � 1+(x � 1)

(a) (b)1 2 3 4 5

1

Figure 1.2: Causal polynomial B-splines. (a) Construction of the B-spline of degree 0starting from the causal Green function of D. (b) B-splines of degree n = 0, . . . ,4, whichbecome more bell-shaped (and beautiful) as n increases.

spline specified by (1.3), we have that

D f1(x) =X

k2Za1[k]±(x °k) (1.5)

where the weights of the integer-shifted Dirac impulses ±(·°k) are given by the corres-ponding jump size of the function: a1[k] = f [k]° f [k °1]. The main point is that the ap-plication of the operator D = d

dx uncovers the spline discontinuities (a.k.a. knots) whichare located on the integer grid: Its effect is that of a mathematical A-to-D conversion sincethe r.h.s. of (1.5) corresponds to the continuous-domain representation of a discrete sig-nal commonly used in the theory of linear systems. In the nomenclature of splines, wesay that f1(x) is a cardinal D-spline 3, which is a special case of a general non-uniformD-spline where the knots can be located arbitrarily (cf. Fig. 1.1c).

The next fundamental observation is that the expansion coefficients in (1.5) are obtainedvia a finite-difference scheme which is the discrete counterpart of differentiation. To getsome further insight, we define the finite-difference operator

Dd f (x) = f (x)° f (x °1).

The latter turns out to be a smoothed version of the derivative

Dd f (x) = (Ø0+§D f )(x),

where the smoothing kernel is precisely the B-spline generator for the expansion (1.3). Anequivalent manifestation of this property can be found in the relation

Ø0+(x) = DdD°1±(x) = Dd1+(x) (1.6)

where the unit step 1+(x) = 1[0,+1) (a.k.a. the Heaviside function) is the causal Greenfunction 4 of the derivative operator. This formula is illustrated in Fig. 1.2a. Its Fourier-domain counterpart is

Ø0+(!) =

Z

RØ0+(x)e°j!x dx = 1°e°j!

j!(1.7)

which is recognized as being the ratio of the frequency responses of the operators Dd andD, respectively.

3. Other brands of splines are defined in the same fashion by replacing the derivative D by some other differ-ential operator generically denoted by L.

4. We say that Ω(x) is the causal Green function of the shift-invariant operator L if Ω is causal and satisfiesLΩ = ±. This can also be written as L°1±= Ω, meaning that Ω is the causal impulse response of the shift-invariantinverse operator L°1.

6



1.3. From splines to stochastic processes, or when Schoenberg meets Lévy

Thus, the basic Lego component, Ø0+, is much more than a mere building block: it

is also a kernel that characterizes the approximation that is made when replacing acontinuous-domain derivative by its discrete counterpart. This idea (and its generaliz-ation for other operators) will prove to be one of the key ingredient in our formulation ofsparse stochastic processes.

1.3.2 Higher-degree polynomial splines

A slightly more sophisticated model is to select a piecewise-linear reconstruction whichadmits the similar B-spline expansion

f2(x) =X

k2Zf [k +1]Ø1

+(x °k) (1.8)

where

Ø1+(x) =

°

Ø0+§Ø0

+¢

(x) =

8

>

>

>

<

>

>

>

:

x, for 0 ∑ x < 1

2°x, for 1 ∑ x < 2

0, otherwise

(1.9)

is the causal B-spline of degree 1, a triangular function centered at x = 1. Note that the useof a causal generator is compensated by the unit shifting of the coefficients in (1.8), whichis equivalent to re-centering the basis functions on the sampling locations. The mainadvantage of f2 in (1.8) over f1 in (1.3) is that the underlying function is now continuous,as illustrated in Fig. 1b.

In an analogous manner, one can construct higher-degree spline interpolants that arepiecewise-polynomials of degree n by considering B-splines atoms of degree n obtainedfrom the (n +1)-fold convolution of Ø0

+(x) (cf. Fig. 1.2b). The generic version of such ahigher-order spline model is

fn+1(x) =X

k2Zc[k]Øn

+(x °k) (1.10)

withØn+(x) =

°

Ø0+§Ø0

+§ · · ·§Ø0+

| {z }

n+1

¢

(x).

The catch though is that, for n > 1, the expansion coefficients c[k] in (1.10) are notidentical to the sample values f [k] anymore. Yet, they are in a one-to-one correspond-ence with them and can be determined efficiently by solving a linear system of equationsthat has a convenient band-diagonal Toeplitz structure [Uns99].

The higher-order counterparts of relations (1.7) and (1.6) are

Øn+(!) =

µ

1°e°j!

j!

∂n+1

and

Øn+(x) = Dn+1

d D°(n+1)±(x)

=Dn+1

d (x)n+

n!(1.11)

=n+1X

m=0(°1)m

√

n +1m

!

(x °k)n+

n!.

7



1. INTRODUCTION

with (x)+ = max(0, x). The latter explicit time-domain formula follows from the fact thatthe impulse response of the (n+1)-fold integrator (or, equivalently, the causal Green func-

tion of Dn+1) is the one-sided power function D°(n+1)±(x) = xn+

n! . This elegant formula isdue to Schoenberg, the father of splines [Sch46]. He also proved that the polynomial B-spline of degree n is the shortest cardinal Dn+1-spline and that its integer translates forma Riesz basis of such polynomial splines. In particular, he showed that the B-spline rep-resentation (1.10) is unique and stable, in the sense that

k fnk2L2

=Z

R| fn(x)|2 dx ∑ kck2

`2=

X

k2Z

Ø

Øc[k]Ø

Ø

2.

Note that the inequality above becomes an equality for n = 0 since the squared L2-norm ofthe corresponding piecewise-constant function is easily converted into a sum. This alsofollows from Parseval’s identity because the B-spline basis {Ø0

+(·°k)}k2Z is orthonormal.

One last feature is that polynomial splines of degree n are inherently smooth, in the sensethat they are n-times differentiable everywhere with bounded derivatives—that is, Höldercontinuous of order n. In the cardinal setting, this follows from the property that

DnØn+(x) = DnDn+1

d D°(n+1)±(x)

= Dnd DdD°1±(x) = Dn

dØ0+(x),

which indicates that the nth-order derivative of a B-spline of degree n is piecewise-constant and bounded.

1.3.3 Random splines, innovations, and Lévy processes

To make the link with Lévy processes, we now express the random counterpart of (1.5) as

Ds(x) =X

nan±(x °xn) = w(x) (1.12)

where the locations xn of the Dirac impulses are uniformly distributed over the real line(Poisson distribution with rate parameter ∏) and the weights an are i.i.d. with amplitudedistribution p A(a). For simplicity, we are also assuming that p A is symmetric with finitevariance æ2

a =R

R a2p A(a) da. We shall refer to w as the innovation of the signal s since itcontains all the parameters that are necessary for its description. Clearly, s is a signal witha finite rate of innovation, a term that was coined by Vetterli et al.[VMB02].

The idea now is to reconstruct s from its innovation w by integrating (1.12). This requiresthe specification of some boundary condition to fix the integration constant. Since theconstraint in the definition of Lévy processes is s(0) = 0 (with probability one), we firstneed to find a suitable antiderivative operator, which we shall denote by D°1

0 . In the eventwhen the input function is Lebesgue integrable, the relevant operator is readily specifiedas

D°10 '(x) =

Zx

°1'(ø) dø°

Z0

°1'(ø) dø=

8

>

>

<

>

>

:

Zx

0'(ø) dø, for x ∏ 0

°Z0

x'(ø) dø, for x < 0

It is the corrected version (subtraction of the proper signal-dependent constant) of theconventional shift-invariant integrator D°1 for which the integral runs from °1 to x. TheFourier counterpart of this definition is

D°10 '(x) =

Z

R

e j!x °1j!

'(!)d!2º

8




which can be extended, by duality, to a much larger class of generalized functions (cf.Chapter 5). This is feasible because the latter expression is a regularized version of anintegral that would otherwise be singular, since the division by j! is tempered by a propercorrection in the numerator: e j!x°1 = j!+O(!2). It is important to note that D°1

0 is scale-invariant (in the sense that it commutes with scaling), but not shift-invariant, unlike D°1.Our reason for selecting D°1

0 over D°1 is actually more fundamental than just imposingthe “right” boundary conditions. It is guided by stability considerations: D°1

0 is a validright inverse of D in the sense that DD°1

0 = Id over a large class of generalized functions,while the use of the shift-invariant inverse D°1 is much more constrained. Other thanthat, both operators share most of their global properties. In particular, since the finite-difference operator has the convenient property of annihilating the constants that are inthe null space of D, we see that

Ø0+(x) = DdD°1

0 ±(x) = DdD°1±(x). (1.13)

Having the proper inverse operator at our disposal, we can apply it to formally solve thestochastic differential equation (1.12). This yields the explicit representation of the sparsestochastic process:

s(x) = D°10 w(x) =

X

nanD°1

0 {±(·°xn)}(x)

=X

nan

°

1+(x °xn)°1+(°xn)¢

(1.14)

where the second term 1+(°xn) in the last parenthesis ensures that s(0) = 0. Clearly,the signal defined by (1.14) is piecewise-constant (random spline of degree 0) and itsconstruction is compatible with the classical definition of a compound Poisson process,which is a special type of Lévy process. A representative example is shown in Fig. 1.1c.

It can be shown that the innovation w specified by (1.12), made of random impulses, is aspecial type of continuous-domain white noise with the property that

E{w(x)w(x 0)} = Rw (x °x 0) =æ20±(x °x 0) (1.15)

where æ20 = ∏æ2

a is the product of the Poisson rate parameter ∏ and the variance æ2a of

the amplitude distribution. More generally, we can determine the correlation form of theinnovation, which is given by

E{h'1, wih'2, wi} =æ20h'1,'2i (1.16)

for any real-valued functions '1,'2 2 L2(R) and h'1,'2i=R

R'1(x)'2(x) dx.

This suggests that we can apply the same operator-based synthesis to other types ofcontinuous-domain white noise, as illustrated in Fig. 1.3. In doing so, we are able togenerate the whole family of Lévy processes. In the case where w is a white Gaussiannoise, the resulting signal is a Brownian motion which has the property of being con-tinuous almost everywhere. A more extreme case arises when w is an alpha-stable noisewhich yields a stable Lévy process whose sample path has a few really large jumps and isrougher than a Brownian motion .

In the classical literature on stochastic processes, Lévy processes are usually defined interms of their increments, which are i.i.d. and infinitely-divisible random variables (cf.Chapter 7). Here, we shall consider the so-called increment process u(x) = s(x)° s(x °1), which has a number of remarkable properties. The key observation is that u, in itscontinuous-domain version, is the convolution of a white noise (innovation) with the B-spline kernel Ø0

+. Indeed, the relation (1.13) leads to

u(x) = Dds(x) = DdD°10 w(x) = (Ø0

+§w)(x). (1.17)

9



1. INTRODUCTION

0.0 0.2 0.4 0.6 0.8 1.0

0 0

0.0 0.2 0.4 0.6 0.8 1.00 0

0.0 0.2 0.4 0.6 0.8 1.0

0 0

Compound Poisson

Brownian motion

Integrator

Gaussian

Impulsive � t

0d�

Lévy flight

s(t)w(t)

White noise (innovation) Lévy process

S�S (Cauchy)

Figure 1.3: Synthesis of different brands of Lévy processes by integration of a correspond-ing continuous-domain white noise. The alpha-stable excitation in the bottom exampleis such that the increments of the Lévy process have a symmetric Cauchy distribution.

This implies, among other things, that u is stationary, while the original Lévy process s isnot (since D°1

0 is not shift-invariant). It also suggests that the samples of the incrementprocess u are independent if they are taken at a distance of 1 or more apart, the limitcorresponding to the support of the rectangular convolution kernel Ø0

+. When the auto-correlation function Rw (y) of the driving noise is well-defined and given by (1.15), we caneasily determine the autocorrelation of u as

Ru(y) = E{u(x)u(x + y)} =°

Ø0+§ (Ø0

+)_ §Rw¢

(y) =æ20Ø

1+(y °1) (1.18)

where (Ø0+)_(x) =Ø0

+(°x). It is proportional to the autocorrelation of a rectangle, which isa triangular function (centered B-spline of degree 1).

Of special interest to us are the samples of u on the integer grid, which are characterizedfor k 2Z as

u[k] = s(k)° s(k °1) = hw,Ø0+(·°k)i.

The r.h.s. relation can be used to show that the u[k] are i.i.d. because w is white, station-ary, and the supports of the analysis functions Ø0

+(·° k) are non-overlapping. We shallrefer to {u[k]}k2Z as the discrete innovation of s. Its determination involves the samplingof s at the integers and a discrete differentiation (finite differences), in direct analogy withthe generation of the continuous-domain innovation w(x) = Ds(x).

The discrete innovation sequence u[·] will play a fundamental role in signal processingbecause it constitutes a convenient tool for extracting the statistics and characterizingthe samples of a stochastic process. It is probably the best practical way of presenting theinformation because

1. we never have access to the full signal s(x), which is a continuously-defined entity,and

2. we cannot implement the whitening operator (derivative) exactly, not to mention thatthe continuous-domain innovation w(x) does not admit an interpretation as an or-dinary function of x. For instance, Brownian motion is not differentiable anywhere inthe classical sense.

This points to the fact that the continuous-domain innovation model is a theoretical con-struct. Its primary purpose is to facilitate the determination of the joint probability dis-tributions of any series of linear measurements of a wide class of sparse stochastic pro-

10




��

��

�2,0

�1,0�1,0

�0,0 �0,2�0,2�0,0

i = 0

i = 1

i = 2

��

��

�

(a)

��

��

i = 0

i = 1

i = 2

��

��

��2,0

(b)

Figure 1.4: Dual pair of multiresolution bases where the first kind of functions (wavelets)are the derivatives of the second (hierarchical basis functions): (a) (unnormalized) Haarwavelet basis. (b) Faber-Schauder basis (a.k.a. Franklin system).

cesses, including the discrete version of the innovation which has the property of beingmaximally decoupled.

1.3.4 Wavelet analysis of Lévy processes and M-term approximations

Our purpose so far has been to link splines and Lévy processes to the derivative operatorD. We shall now exploit this connection in the context of wavelet analysis. To that end,we consider the Haar basis {√i ,k }i2Z,k2Z, which is generated by the Haar wavelet

√Haar(x) =

8

>

>

>

<

>

>

>

:

1, for 0 ∑ x < 12

°1, for 12 ∑ x < 1

0, otherwise.

(1.19)

The basis functions, which are orthonormal, are given by

√i ,k (x) = 2°i /2√Haar

µ

x °2i k

2i

∂

(1.20)

where i and k are the scale (dilation of√Haar by 2i ) and location (translation of√i ,0 by 2i k)indices, respectively. A closely related system is the Faber-Schauder basis {¡i ,k (·)}i2Z,k2Z,which is made up of B-splines of degree 1 in a wavelet-like configuration (cf. Fig. 1.4).

Specifically, the hierarchical triangle basis functions are given by

¡i ,k (x) =Ø1+

µ

x °2i k

2i°1

∂

. (1.21)

While these functions are orthogonal within any given scale (because they are non-overlapping), they fail to be so across scales. Yet, they form a Schauder basis, which isa somewhat weaker property than being a Riesz basis of L2(R).

The fundamental observation for our purpose is that the Haar system can be obtainedby differentiating the Faber-Schauder one, up to some amplitude factor. Specifically, we

11



1. INTRODUCTION

have the relations

√i ,k = 2i /2°1D¡i ,k (1.22)

D°10 √i ,k = 2i /2°1¡i ,k . (1.23)

Let us now apply (1.22) to the formal determination of the wavelet coefficients of the Lévyprocess s = D°1

0 w . The crucial manipulation, which will be justified rigorously within theframework of generalized stochastic processes (cf. Chapter 3), is hs,D¡i ,ki= hD§s,¡i ,ki=°hw,¡i ,ki where we have used the adjoint relation D§ =°D and the right inverse propertyof D°1

0 . This allows us to express the wavelet coefficients as

Wi ,k = hs,√i ,ki=°2i /2°1hw,¡i ,ki

which, up to some scaling factors, amounts to a Faber-Schauder analysis of the innov-ation w = Ds. Since the triangle functions ¡i ,k are non-overlapping within a given scaleand the innovation is independent at every point, we immediately deduce that the corres-ponding wavelet coefficients are also independent. However, the decoupling is not per-fect across scales due to the parent-to-child overlap of the triangle functions. The residualcorrelation can be determined from the correlation form (1.16) of the noise, according to

E{Wi , j Wi 0,k 0 } = 2(i+i 0)/2°2E©

hw,¡i ,kihw,¡i 0,k 0 i™

/h¡i ,k ,¡i 0,k 0 i.

Since the triangle functions are non-negative, the residual correlation is zero iff. ¡i ,k and¡i 0,k 0 are non-overlapping, in which case the wavelet coefficients are independent as well.We can also predict that the wavelet transform of a compound Poisson process will besparse (i.e., with many vanishing coefficients) because the random Dirac impulses of theinnovation will intersect only few Faber-Schauder functions, an effect that becomes moreand more pronounced as the scale gets finer. The level of sparsity can therefore be expec-ted to be directly dependent upon ∏ (the density of impulses per unit length).

To quantify this behavior, we applied Haar wavelets to the compression of sampled real-izations of Lévy processes and compared the results with those of the “optimal” textbooksolution for transform coding. In the case of a Lévy process with finite variance, theKarhunen-Loève transform (KLT) can be determined analytically from the knowledge ofthe covariance function E{s(x)s(y)} =C

°

|x|+ |y |° |x ° y |¢

where C is an appropriate con-stant. The KLT is also known to converge to the discrete cosine transform (DCT) as the sizeof the signal increases. The present compression task is to reconstruct a series of 4096-point signals from their M largest transform coefficients, which is the minimum-error se-lection rule dictated by Parseval’s relation. Fig. 1.5 displays the graph of the relative quad-ratic M-term approximation errors for the three types of Lévy processes shown in Fig. 1.3.We also considered the identity transform as baseline, and the DCT as well, whose resultswere found to be indistinguishable from those of the KLT. We observe that the KLT per-forms best in the Gaussian scenario, as expected. It is also slightly better than wavelets atlarge compression ratios for the compound Poisson process (piecewise-constant signalwith Gaussian-distributed jumps). In the latter case, however, the situation changes dra-matically as M increases since one is able to reconstruct the signal perfectly from a frac-tion of the wavelet coefficients, in reason of the sparse behavior explained above. The ad-vantage of wavelets over the KLT/DCT is striking for the Lévy flight (SÆS distribution withÆ= 1). While these findings are surprising at first, they do not contradict the classical the-ory which tells us that the KLT has the minimum basis-restriction error for the given classof processes. The twist here is that the selection of the M largest transform coefficientsamounts to some adaptive reordering of the basis functions, which is not accounted forin the derivation of the KLT. The other point is that the KLT solution is not defined for thethird type of SÆS process whose theoretical covariances are unbounded—this does not

12



1.3. From splines to stochastic processes, or when Schoenberg meets Lévyre

lati

ve�2 2

erro

rre

lati

ve�2 2

erro

r

rela

tive

�2 2er

ror

BrownianMotion

CompoundPoisson

P(x = 0) = 0.9

Cauchy- Levy

Nsignal = 212

Naverage = 1000

(a)

(b)

(c)

Figure 1.5: Haar wavelets vs. KLT: M-term approximation errors for different brands ofLévy processes. (a) Gaussian (Brownian motion). (b) Compound Poisson with Gaussianjump distribution and e°∏ = 0.9. (c) Alpha-stable (symmetric Cauchy). The results areaverages over 1000 realizations.

prevent us from applying the Gaussian solution/DCT to a finite-length realization whose`2-norm is finite (almost surely). This simple experiment with various stochastic mod-els corroborates the results obtained with image compression where the superiority ofwavelets over the DCT (e.g., JPEG2000 vs. JPEG) is well-established.

1.3.5 Lévy’s wavelet-based synthesis of Brownian motion

We close this introductory chapter by making the connection with a multiresolutionscheme that Paul Lévy developed in the 1930s to characterize the properties of Brownianmotion. To do so, we adopt a point of view that is the dual of the one in Section 1.3.4: itessentially amounts to interchanging the analysis and synthesis functions. As first step,we expand the innovation w in the orthonormal Haar basis and obtain

w =X

i2Z

X

k2ZZi ,k√i ,k with Zi ,k = hw,√i ,ki.

This is acceptable 5 under the finite-variance hypothesis on w . Since the Haar basis isorthogonal, the coefficients Zi ,k in the above expansion are fully decorrelated, but notnecessarily independent, unless the white noise is Gaussian or the corresponding basis

5. The convergence in the sense of distributions is ensured since the wavelet coefficients of a rapidly-decaying test function ' are rapidly-decaying as well.

13



1. INTRODUCTION

functions do not overlap. We then construct the Lévy process s = D°10 w by integrating the

wavelet expansion of the innovation, which yields

s(x) =X

i2Z

X

k2ZZi ,k D°1

0 √i ,k (x)

=X

i2Z

X

k2Z2i /2°1Zi ,k¡i ,k (x). (1.24)

The representation (1.24) is of special interest when the noise is Gaussian, in which casethe coefficients Zi ,k are i.i.d. and follow a standardized Gaussian distribution. The for-mula then maps into Lévy’s recursive mid-point method of synthesizing Brownian mo-tion which Yves Meyer singles out as the first use of wavelets to be found in the literature.The Faber-Schauder expansion (1.24) stands out as a localized, practical alternative toWiener’s original construction of Brownian motion which involves a sum of harmoniccosines (KLT-type expansion).

1.4 Historical notes

14



Chapter 2

Road map to the monograph

The writing of this monograph was motivated by our desire to formalize and extend theideas presented in Section 1.3 to a class of differential operators much broader than thederivative D. Concretely, this translates into the investigation of the family of stochasticprocesses specified by the general innovation model that is summarized in Fig. 2.1. Thecorresponding generator of random signals (upper part of the diagram) has two funda-mental components: (1) a continuous-domain noise excitation w , which may be thoughtof as being composed of a continuum of i.i.d. random atoms (innovations), and (2) a de-terministic mixing procedure (formally described by L°1) which couples the random con-tributions and imposes the correlation structure of the output. The concise description ofthe model is Ls = w where L is the whitening operator. The term “innovation” refers to thefact that w represents the unpredictable part of the process. When the inverse operatorL°1 is linear shift-invariant (LSI), the signal generator reduces to a simple convolutionalsystem which is characterized by its impulse response (or, equivalently, its frequency re-sponse). Innovation modeling has a long tradition in statistical communication theoryand signal processing; it is the basis for the interpretation of a Gaussian stationary pro-cess as a filtered version of a white Gaussian noise [Kai70, Pap91].

In the present context, the underlying objects are continuously-defined. The innova-tion model then results from defining a stochastic process (or random field when theindex variable r is a vector in Rd ) as the solution of a stochastic differential equa-tion (SDE) driven by a particular brand of noise. The nonstandard aspect here is thatwe are considering the innovation model in its greatest generality, allowing for non-Gaussian inputs and differential systems that are not necessarily stable. We shall arguethat these extensions are essential for making this type of modeling compatible with thelatest developments in signal processing pertaining to the use of wavelets and sparsity-promoting reconstruction algorithms. Specifically, we shall see that it is possible to gen-erate a wide variety of sparse processes by replacing the traditional Gaussian input bysome more general brand of (Lévy) noise, within the limits of mathematical admissib-ility [UTAKed, UTSed]. We shall also demonstrate that such processes admit a sparserepresentation in a wavelet basis under the assumption that L is scale-invariant. Thedifficulty there is that scale-invariant SDEs are inherently unstable (due to the pres-ence of poles at the origin); yet, we shall see that they can still result in a proper spe-cification of fractal-type processes, albeit not within the usual framework of stationaryprocesses[TVDVU09, UT11]. The nontrivial aspect of these generalizations is that theynecessitate the resolution of instabilities—in the form of singular integrals. This is re-quired not only at the system level, to allow for non-stationary processes, but also at thestochastic level because the most interesting sparsity patterns are associated with un-bounded Lévy measures.

15



2. ROAD MAP TO THE MONOGRAPH

s(x)

White Lévy noise Generalizedstochastic process

Shaping filter

(appropriate boundary conditions)

Whitening operator

L�1

L

w(x)

w(r) s(r)

Figure 2.1: Innovation model of a generalized stochastic process. The process is gener-ated by application of the (linear) inverse operator L°1 to a continuous-domain white-noise process w . The generation mechanism is general in the sense that it applies to thecomplete family of Lévy noises, including Gaussian noise as the most basic (non-sparse)excitation. The output process s is stationary iff. L°1 is shift-invariant.

Before proceeding with the statistical characterization of sparse stochastic processes, weshall highlight the central role of the operator L and make a connection with spline theoryand the construction of signal-adapted wavelet bases.

2.1 On the implications of the innovation model

To motivate our approach, we start with an informal discussion, leaving the technicalitiesaside. The stochastic process s in Fig. 2.1 is constructed by applying the (integral) oper-ator L°1 to some continuous-domain white noise w . In most cases of interest, L°1 hasan infinitely-supported impulse response which introduces long-range dependencies. Ifwe are aiming at a concise statistical characterization of s, it is essential that we somehowinvert this integration process, the ideal being to apply the operator L which would giveback the innovation signal w that is fully decoupled. Unfortunately, this is not feasible inpractice because we do not have access to the signal s(r ) over the entire domain r 2 Rd ,but only to its sampled values on a lattice or, more generally, to a series of coefficientsin some appropriate basis. Our analysis options are essentially two fold, as described inSections 2.1.1 and 2.1.2.

2.1.1 Linear combination of sampled values

Given the sampled values s(k),k 2 Zd , the best we can aim at is to implement a discreteversion of the operator L, which is denoted by Ld. In effect, Ld will act on the sampledversion of the signal as a digital filter. The corresponding continuous-domain descriptionof its impulse response is

Ld±(r ) =X

k2Zd

d [k]±(r °k)

with some appropriate weights d . To fix ideas, Ld may correspond to the numerical ver-sion of the operator provided by the finite-difference method of approximating derivat-ives.

The interest is now to characterize the (approximate) decoupling effect of this discreteversion of the whitening operator. This is quite feasible when the continuous-domaincomposition of the operators Ld and L°1 is shift-invariant with impulse response ØL(r )which is assumed to be absolutely integrable (BIBO stability). In that case, one readily

16



2.1. On the implications of the innovation model

finds thatu(r ) = Lds(r ) = (ØL §w)(r ) (2.1)

whereØL(r ) = LdL°1±(r ). (2.2)

This suggests that the decoupling effect will be the strongest when the convolution kernelØL is the most localized and closest to an impulse 1. We call ØL the generalized B-splineassociated with the operator L. For a given operator L, the challenge will be to design themost localized kernel ØL, which is the way of approaching the discretization problem thatbest matches our statistical objectives. The good news is that this is a standard problemin spline theory, meaning that we can take advantage of the large body of techniquesavailable in this area, even though they have been hardly applied to the stochastic settingso far.

2.1.2 Wavelet analysis

The second option is to analyze the signal s using wavelet-like functions {√i (·° rk )}. Forthat purpose, we assume that we have at our disposal some real-valued “L-compatible”generalized wavelets which, at a given resolution level i , are such that

√i (r ) = L§¡i (r ). (2.3)

Here, L§ is the adjoint operator of L and ¡i is some smoothing kernel with good local-ization properties. The interpretation is that the wavelet transform provides some kindof multiresolution version of the operator L with the effective width of the kernels ¡i in-creasing in direct proportion to the scale; typically, ¡i (r ) / ¡0(r /2i ). Then, the waveletanalysis of the stochastic process s reduces to

hs,√i (·° r0)i = hs,L§¡i (·° r0)i= hLs,¡i (·° r0)i= hw,¡i (·° r0)i= (¡_

i §w)(r0) (2.4)

where ¡_i (r ) = ¡i (°r ) is the reversed version of ¡i . The remarkable aspect is that the

effect is essentially the same as in (2.1) so that it makes good sense to develop a commonframework to analyze white noise.

This is all nice in principle as long as one can construct “L-compatible” wavelet bases. Forinstance, if L is a pure nth-order derivative operator—or by extension, a scale-invariantdifferential operator—then the above reasoning is directly applicable to conventionalwavelets bases. Indeed, these are known to behave like multiscale versions of derivat-ives due to their vanishing-moment property [Mey90, Dau92, Mal09]. In prior work, wehave linked this behavior, as well as a number of other fundamental wavelet properties, tothe polynomial B-spline convolutional factor that is necessarily present in every waveletthat generates a multiresolution basis of L2(R) [UB03]. What is not so widely known is thatthe spline connection extends to a much broarder variety of operators—not necessarilyscale-invariant—and that it also provides a general recipe for constructing wavelet-likebasis functions that are matched to some given operator L. This has been demonstratedin 1D for the entire family of ordinary differential operators [KU06]. The only significantdifference with the conventional theory of wavelets is that the smoothing kernels ¡i arenot necessarily rescaled versions of each other.

1. One may be tempted to pretend that ØL is a Dirac impulse, which amounts to neglecting all discretizationeffects. Unfortunately, this is incorrect and most likely to result in false statistical conclusions. In fact, we shallsee that the localization deteriorates as the order of the operator increases, inducing higher (Markov) orders ofdependencies.

17



2. ROAD MAP TO THE MONOGRAPH

Note that the “L-compatible” property is relatively robust. For instance, if L = L0L0, thenan “L-compatible” wavelet is also L0-compatible with ¡0

i = L0¡i . The design challenge inthe context of stochastic modeling is thus to come up with a suitable wavelet basis suchthat ¡i in (2.3) is most localized—possibly, of compact support.

2.2 Organization of the monograph

The reasoning of Section 2.1 is appealing because of its conceptual simplicity and gen-erality. Yet, the precise formulation of the theory requires some special care because theunderlying stochastic objects are infinite-dimensional and possibly highly singular. Forinstance, we are faced with a major difficulty at the onset because the continuous-domaininput of our model (the innovation w) does not admit a conventional interpretation as afunction of the domain variable r . This entity can only be probed indirectly by formingscalar products with test functions in accordance with Laurent Schwartz’ theory of distri-butions, so that the use of advanced mathematics is unavoidable.

For the benefit of readers who would not be familiar with some the concepts used inthis monograph, we provide the relevant mathematical background in Chapter 3, whichalso serves the purpose of introducing the notation. The first part is devoted to thedefinition of the relevant function spaces with special emphasis on generalized functions(a.k.a. tempered distributions) which play a central role in our formulation. The secondpart reviews the classical, finite-dimensional tools of probability theory and shows howsome of the concepts (e.g., characteristic function, Bochner’s theorem) are extendable tothe infinite-dimensional setting within the framework of Gelfand’s theory of generalizedstochastic processes [GV64].

Chapter 4 is devoted to the mathematical specification of the innovation model. Sincethe theory gravitates around the notion of Lévy exponents, we start with a systematicinvestigation of such functions, denoted by f (!), which are fundamental to the (clas-sical) study of infinitely-divisible probability laws. In particular, we discuss their canon-ical representation given by the Lévy-Khinchine formula. In Section 4.4, we make use ofthe powerful Minlos-Bochner theorem to transfer those representations to the infinite-dimensional setting. The fundamental result for our theory is that the class of admissiblecontinuous-domain innovations for the model in Fig. 2.1 is constrained to the so-calledfamily of white Lévy noises, each brand being uniquely characterized by a Lévy exponentf (!). We conclude the chapter with the presentation of mathematical criteria for the ex-istence of solutions of Lévy-driven SDEs (stochastic differential equations) and providethe functional tools for the complete statistical characterization of these processes. In-terestingly, the classical Gaussian processes are covered by the formulation (by settingf (!) =° 1

2 |!|2), but they turn out to be the only non-sparse members of the family.

Besides the random excitation w , the second fundamental component of the innova-tion model in Fig. 2.1 is the inverse L°1 of the whitening operator L. It must fulfill somecontinuity/boundedness condition in order to yield a proper solution of the underlyingSDE. The construction of such inverses (shaping filters) is the topic of Chapter 5, whichpresents a systematic catalog of the solutions that are currently available, including recentconstructs for scale-invariant/unstable SDEs.

In Chapter 6, we review the tools that are available from the theory of splines in relationto the specification of the analysis kernels in Equations (2.1) and (2.3) above. Remarkably,the techniques are quite generic and applicable for any operator L that admits a properinverse L°1. This is not too surprising because we have taken advantage of our expertknowledge of splines to engineer/derive the solutions that are presented in Chapter 5.Indeed, by writing a generalized B-spline as ØL = LdL°1±, one can appreciate that the

18



2.2. Organization of the monograph

construction of a B-spline for some operator L implicitly provides the solution of twoinnovation-related problems at once: 1) the formal inversion of the operator L (for solv-ing the SDE), and 2) the proper discretization of L through a finite-difference scheme. Theleading thread in our formulation is that these two tasks should not be dissociated—thisis achieved formally via the identification of ØL which actually results in simplified andstreamlined mathematics.

In Chapter 7, we apply our framework to the functional specification of a variety of gen-eralized stochastic processes, including the classical family of Gaussian stationary pro-cesses and their sparse counterparts. We also characterize non-stationary processes thatare solutions of unstable SDEs. In particular, we describe higher-order extensions of Lévyprocesses, as well as a whole variety of fractal-type processes.

In Chapter 8, we rely on our functional characterization to obtain a maximally-decoupledrepresentation of sparse stochastic processes by application of the discretized version ofthe whitening operator or by suitable wavelet expansion.

While the transformed domain statistics can be worked out explicitly, our main point inChapter 9 is to show that the sparsity pattern of the input noise is essentially preserved.Apart from a shaping effect that can be quantified, the resulting PDF remains within thesame family of infinite-divisible laws.

Chapter 10 is devoted to the application of the theory to the general problem of recov-ering signals from incomplete, noisy measurements, which is highly relevant to signalprocessing and biomedical imaging. To that end, we develop a general framework forthe discretization of linear inverse problems using a suitable set of basis functions (e.g.,B-splines or wavelets) which is analogous to the finite-element method for solving PDEs.The central element is the “projection” of the continuous-domain stochastic model of thesignal onto the (finite dimensional) reconstruction space in order to specify the prior stat-istical distribution of the signal. We then apply Bayes’ rule to derive corresponding sig-nal estimators (MAP or MMSE). We present examples of imaging applications includingwavelet-domain signal denoising, deconvolution, and the reconstruction of MRI data.

19



Chapter 3

Mathematical context and background

In this chapter we summarize some of the mathematical preliminaries for the remain-ing chapters. These concern the function spaces used in the book, duality, generalizedfunctions, probability theory, and generalized random processes. Each of these topics isdiscussed in a separate section.

For the most part, the theory of function spaces and generalized functions can be seen asan infinite-dimensional generalization of linear algebra (function spaces generalize Rn ,and continuous linear operators generalize matrices). Similarly, the theory of generalizedrandom processes involves the generalization of the idea of a finite random vector in Rn

to an element of an infinite-dimensional space of generalized functions.

To give a taste of what is to come, we briefly compare finite- and infinite-dimensionaltheories in Tables 3.1 and 3.2. The idea, in a nutshell, is to substitute vectors by (gener-alized) functions. Formally, this extension amounts to replacing some finite sums (in thefinite-dimensional formulation) by integrals. Yet, in order for this to be mathematicallysound, one needs to properly define the underlying objects as elements of some infinite-dimensional vector space, to specify the underlying notion(s) of convergence (which isnot an issue in Rn), while ensuring that some basic continuity conditions are met.

The impatient reader who is not directly concerned by those mathematical issues mayskip what follows the tables at first reading and consult these sections later as he may feelthe need. Yet, he should be warned that the material on infinite-dimensional probabilitytheory from Subsection 3.4.4 to the end of the chapter is fundamental to our formulation.The mastery of those notions also requires a good understanding of function spaces andgeneralized functions which are covered in the first part of the chapter.

3.1 Some classes of function spaces

By the term function we shall intend elements of various function spaces. At a minimum, afunction space is a set X along with some criteria for determining, first, whether or not agiven “function”'='(r ) belongs to X (in mathematical notation, ' 2X ) and, secondly,given','0 2X , whether or not' and'0 describe the same object in X (in mathematicalnotation, ' = '0). Most often, in addition to these, the space X has additional structure(see below).

In this book we shall largely deal with two types of function spaces: complete normedspaces such as Lebesgue Lp spaces, and nuclear spaces such as the Schwartz space S andthe space D of compactly supported test functions, as well as their duals S 0 and D0, whichare spaces of generalized functions. These two categories of spaces (complete-normed

21



3. MATHEMATICAL CONTEXT AND BACKGROUND

finite-dimensional theory (linear al-gebra)

infinite-dimensional theory (functionalanalysis)

Euclidean space RN , complexificationCN

function spaces such as the Lebesguespace Lp (Rd ) and the space of tempereddistributions S 0(Rd ), among others.

vector x = (x1, . . . , xN ) in RN or CN function f (r ) in S 0(Rd ), Lp (Rd ), etc.

bilinear scalar product

hx , yi=PNn=1 xn yn h', g i=

R

'(r )g (r ) dr

' 2 S (Rd ) (test function), g 2 S 0(Rd )(generalized function), or

' 2 Lp (Rd ), g 2 Lq (Rd ) with 1p + 1

q = 1, forinstance.

equality: x = y () xn = yn various notions of equality (depends onthe space), such as

() hu, xi= hu, yi, 8u 2RN weak equality of distributions: f = g 2S 0(Rd ) () h', f i = h', g i for all ' 2S (Rd ),

() kx ° yk2 = 0 almost-everywhere equality: f = g 2Lp (Rd ) ()

R

Rd | f (r )° g (r )|p dr = 0.

linear operators RN !RM continuous linear operators S (Rd ) !S 0(Rd )

y = Ax ) ym =PNn=1 amn xn g = A' ) g (r ) =

R

Rd a(r , s)'(s) ds forsome a 2 S 0(Rd £Rd ) (Schwartz’ kerneltheorem)

transpose adjoint

hx ,Ayi= hATx , yi h',Ag i= hA§', g i

Table 3.1: Comparison of notions of linear algebra with those of functional analysis andthe theory of distributions (generalized functions). See Sections 3.1-3.3 for an explana-tion.

and nuclear) cannot overlap, except in finite dimensions. Since the function spaces thatare of interest to us are infinite-dimensional (they do not have a finite vector-space basis),the two categories are mutually exclusive.

The structure of each of the afore-mentioned spaces has two aspects. First, as a vectorspace over the real numbers or its complexification, the space has an algebraic structure.Second, with regard to the notions of convergence and taking of limits, the space has atopological structure. The algebraic structure lends meaning to the idea of a linear oper-ator on the space, while the topological structure gives rise to the concept of a continuousoperator or map, as we shall see shortly.

All the spaces considered here have a similar algebraic structure. They are either vectorspaces overR, meaning for any','0 in the space and any a 2R, the operations of addition' 7!'+'0 and multiplication by scalars' 7! a' are defined and map the space (denotedhenceforth by X ) into itself. Or, we may take the complexification of a real vector space

22



3.1. Some classes of function spaces

finite-dimensional infinite-dimensional

random variable X in RN generalized stochastic process s in S 0

probability measure PX on RN probability measure Ps on S 0

PX (E) = Prob(X 2 E) =R

E pX (x) dx (pXis a generalized [i.e., hybrid] pdf)

Ps (E) = Prob(s 2 E) =R

E Ps (dg )

for suitable subsets E ΩRN for suitable subsets E ΩS 0

characteristic function characteristic functional

cPX (ª) = E{ejhª,X i} =R

RN ejhª,xipX (x) dx ,ª 2RN

cPs (') = E{ejh',si} =R

S 0 ejh',g iPs (dg ),' 2S

Table 3.2: Comparison of notions of finite-dimensional statistical calculus with the theoryof generalized stochastic processes. See Sections 3.4 for an explanation.

X , composed of elements of the form ' = 'r + j'i with 'r ,'i 2 X and j denoting theimaginary unit. The complexification is then a vector space over C. In the remainder ofthe book, we shall denote a real vector space and its complexification by the same symbol.The distinction, when important, will be clear from the context.

For the spaces with which we are concerned in this book, the topological structure is com-pletely specified by providing a criterion for the convergence of sequences. 1 By this wemean that, for any given sequence ('i ) in X and any ' 2 X , we are equipped with theknowledge of whether or not ' is the limit of ('i ). A topological space is a set X withtopological structure. For normed spaces, the said criterion is given in terms of a norm,while in nuclear spaces it is given in terms of a family of seminorms, as we shall discussbelow. But before that, let us first define linear and continuous operators.

An operator is a mapping from one vector space to another; that is, a rule that associatesan output function A' 2Y to each input ' 2X .

Definition 1 (Linear operator). An operator A : X !Y where X and Y are vector spacesis called linear if for any ','0 2X and a,b 2R (or C),

A(a'+b'0) = aA'+bA'0. (3.1)

Definition 2 (Continuous operator). Let X ,Y be topological spaces. An operator A : X !Y is called sequentially continuous (with respect to the topologies of X and Y ) if, for anyconvergent sequence ('i ) in X with limit ' 2X , the sequence

°

A'i¢

converges to A' in Y ,that is,

limi

A'i = A(limi'i ).

The above definition of continuity coincides with the stricter topological definition forspaces we are interested in.

We shall assume that the topological structure of our vector spaces is such that the op-erations of addition and multiplication by scalars in R (or C) are continuous. With thiscompatibility conditions our object is called a topological vector space.

1. This is in contrast with those topological spaces where one needs to consider generalizations of the notionof a sequence involving partially ordered sets (the so-called nets or filters). Spaces in which a knowledge ofsequences suffices are called sequential.

23




Having defined the two types of structure (algebraic and topological) and their relationwith operators in abstract terms, let us now show concretely how the topological structureis defined for some important classes of spaces.

3.1.1 About the notation: mathematics vs. engineering

So far, we have considered a function in abstract terms as an element of a vector space:' 2X . The more conventional view is that of map' :Rd !R (orC) that associates a value'(r ) to each point r = (r1, . . . ,rd ) 2Rd . Following the standard convention in engineering,we shall therefore also use the notation'(r ) [instead of'(·) or'] to represent the functionusing r as our generic d-dimensional index variable, the norm of which is denoted by|r |2 = Pd

i=1 |ri |2. This is to be contrasted with the point values (or samples) of ' whichwill be denoted using subscripted index variables; i.e., '(rk ) stands for the value of ' atr = rk . Likewise, '(r ° r 0) ='(·° r0) refers to the function ' shifted by r0.

A word of caution is in order here. While the engineering notation has the advantageof being explicit, it can also be felt as being abusive because the point values of ' arenot necessarily well defined, especially when the function presents discontinuities, notto mention the case of generalized functions that do not have a pointwise interpretation.'(r ) should therefore be treated as an alternative notation for ' that reminds us of thedomain of the function and not interpreted literally.

3.1.2 Normed spaces

A norm on X is a map X ! R, usually denoted as ' 7! k'k (with indices used if neededto distinguish between different norms), which fulfils the following properties for all a 2R(or C) and ','0 2X .

k'k ∏ 0 (nonnegativity).

ka'k= |a| k'k (positive homogeneity).

k'+'0k ∑ k'k+k'0k (triangular inequality).

k'k= 0 implies '= 0 (separation of points).

By relaxing the last requirement we obtain a seminorm.

A normed space is a vector space X equipped with a norm.

A sequence ('i ) in a normed space X is said to converge to ' (in the topology of X ), insymbols

limi'i =',

if and only iflim

ik'°'ik= 0.

Let ('i ) be a sequence in X such that for any ≤> 0 there exists an N 2N with

k'i °' j k< ≤ for all i , j ∏ N .

Such a sequence is called a Cauchy sequence. A normed space X is complete if it doesnot have any holes, in the sense that, for every Cauchy sequence in X , there exists an' 2 X such that limi 'i = ' (in other words if every Cauchy sequence has a limit in X ).A normed space that is not complete can be completed by introducing new pointscorres-ponding to the limits of equivalent Cauchy sequences. For example, the real line is thecompletion of the set of rational numbers with respect to the absolute-value norm.

24



3.1. Some classes of function spaces

Examples

Important examples of complete normed spaces are the Lebesgue spaces. The Lebesguespaces Lp (Rd ), 1 ∑ p ∑ 1, are composed of functions whose Lp (Rd ) norm, denoted ask ·kp , is finite, where

k'(r )kp :=(

°R

Rd |'(r )|p dr

¢

1p for 1 ∑ p <1

esssupr2Rd |'(r )| for p =1

and where two functions that are equal almost everywhere are considered to be equival-ent.

We may also define weighted Lp spaces by replacing the shift-invariant Lebesgue meas-ure (dr ) by a weighted measure w(r )dr in the above definitions. In that case, w(r ) isassumed to be a measurable function that is (strictly) positive almost everywhere. In par-ticular, for w(r ) = 1+|r |Æ with Æ> 0, we denote the associated norms as k ·kp,Æ, and thecorresponding normed spaces as Lp,Æ(Rd ). The latter spaces are useful when characteriz-ing the decay of functions at infinity. For example, L1,Æ(Rd ) is the space of functions thatare bounded by a constant multiple of 1

1+|r |Æ almost everywhere.

Some remarkable inclusion properties of Lp,Æ(Rd ), 1 ∑ p ∑1, Æ> 0 are

Æ>Æ0 implies Lp,Æ(Rd ) Ω Lp,Æ0 (Rd ).

L1, dp +≤(Rd ) Ω Lp (Rd ) for any ≤> 0.

Finally, we define the space of rapidly decaying functions, R(Rd ), as the intersection ofall L1,Æ(Rd ) spaces, Æ> 0, or, equivalently, as the intersection of all L1,Æ(Rd ) with Æ 2N.In other words, R(Rd ) contains all bounded functions that essentially decay faster than1/|r |Æ at infinity for all Æ 2 R+. A sequence ( fi ) converges in (the topology of) R(Rd ) ifand only if it converges in all L1,Æ(Rd ) spaces.

The causal exponential ΩÆ(r ) = 1[0,1)(r )eÆr with Re(Æ) < 0 that is central to linear systemstheory is a prototypical example of function included in R(R).

3.1.3 Nuclear spaces

Defining nuclear spaces is neither easy nor particularly intuitive. Fortunately, for our pur-pose in this book, knowing the definition is not necessary. We shall simply assert that cer-tain function spaces are nuclear, in order to use certain results that are true for nuclearspaces (specifically, the Minlos-Bochner theorem, see below). For the sake of complete-ness, a general definition of nuclear spaces is given at the end of this section, but thisdefinition may safely be skipped without compromising the presentation.

Specifically, it will be sufficient for us to know that the spaces D(Rd ) and S (Rd ), whichwe shall shortly define, are nuclear, as are the Cartesian products and powers of nuclearspaces, and their closed subspaces.

To define these spaces, we need to identify their members, as well as the criterion of con-vergence for sequences in the space.

The space D(Rd )

The space of compactly supported smooth test functions, is denoted by D(Rd ). It consistsof infinitely differentiable functions with compact support in Rd . To define its topology,we provide the following criterion for convergence in D(Rd ):

25




A sequence ('i ) of functions in D(Rd ) is said to converge (in the topology of D(Rd )) if

1. There exists a compact (here, meaning closed and bounded) subset K of Rd such thatall 'i are supported inside K .

2. The sequence ('i ) converges in all of the seminorms

k'kK ,n := supr2K

|@n'(r )| for all n 2Nd .

Here, n = (n1, . . . ,nd ) 2 Nd is what is called a multi-index, and @n is shorthand forthe partial derivative @n1

r1 · · ·@ndrd

. We take advantage of the present opportunity also tointroduce two other notations: |n| for

Pdi=1 |ni | and r

n for the productQd

i=1 r nii .

The space D(Rd ) is nuclear (for a proof, see for instance [GV64]).

The Schwartz space S (Rd )

The Schwartz space or the space of so-called smooth and rapidly decaying test functions,denoted as S (Rd ), consists of infinitely differentiable functions ' on Rd , for which all ofthe seminorms defined below are finite:

k'km,n = sup

r2Rd|r m@n'(r )| for all m,n 2Nd .

In other words, S (Rd ) is populated by functions that, together with all of their derivatives,decay faster than the inverse of any polynomial at infinity.

The topology of S (Rd ) is defined by positing that a sequence ('i ) converges in S (Rd ) ifand only if it converges in all of the above seminorms.

The Schwartz space has the remarkable property that its complexification is invariant un-der the Fourier transform. In other words, the Fourier transform, defined by the integral

'(r ) 7! '(!) =F {'}(!) :=Z

Rde°jhr ,!i'(r ) dr ,

and inverted by

'(!) 7!'(r ) =F°1{'}(r ) :=Z

Rdejhr ,!i'(!)

d!

(2º)d,

is a continuous map from S (Rd ) into itself. Our convention here is to use ! 2 Rd as thegeneric Fourier-domain index variable.

In addition, both S (Rd ) and D(Rd ) are closed and continuous under differentiation ofany order and multiplication by polynomials. Lastly, they are included in R(Rd ) andhence in all the Lebesgue spaces, Lp (Rd ), which do not require any smoothness.

General definition of nuclear spaces§

Defining a nuclear space requires us to define nuclear operators. These are operators thatcan be approximated by operators of finite rank in a certain sense (an operator betweenvector spaces is of finite rank if its range is finite-dimensional).

We first recall the notation `p (N), 1 ∑ p < 1, for the space of p-summable sequences;that is, sequences c = (ci )i2N for which

X

i2N|ci |p

26



3.2. Dual spaces and adjoint operators

is finite. We also denote by `1(N) the space of all bounded sequences.

In a complete normed space Y , let (√i )i2N be a sequence with bounded norm (i.e., k√ik ∑M for some M 2 R and all i 2 N). We then denote by M√ the linear operator `1(N) ! Y

which maps a sequence c = (ci )i2N in `1 to the weighted sumX

i2Nci√i

in Y (the sum converges in norm by the triangular inequality).

An operator A : X ! Y , where X ,Y are complete normed spaces, is called nuclear ifthere exists a continuous linear operator

eA : X ! `1 : ' 7!°

ai (')¢

,

an operator§ : `1 ! `1 : (ci ) 7! (∏i ci )

whereP

i |∏i | <1, and a bounded sequence√= (√i ) in Y , such that we can write

A = M√ § eA.

This is equivalent to the following decomposition of A into a sum of rank 1 operators:

A :' 7!X

i2N∏i ai (')√i

The continuous linear operator X ! Y : ' 7! ∏i ai (')√i is of rank 1 because it maps X

into the one-dimensional subspace of Y spanned by √i ; compare (√i ) with a basis and°

ai (')¢

with the coefficients of A' in this basis.

More generally, given an arbitrary topological vector space X and a complete normedspace Y , the operator A : X ! Y is said to be nuclear if there exists a complete normedspace X1, a nuclear operator A1 : X1 ! Y , and a continuous operator B : X ! X1, suchthat

A = A1B.

Finally, X is a nuclear space if any continuous linear map X !Y , where Y is a completenormed space, is nuclear.

3.2 Dual spaces and adjoint operators

Given a space X , a functional on X is a map f that takes X to the scalar field R (or C). Inother words, f takes a function ' 2X as argument and returns the number f (').

When X is a vector space, we may consider linear functionals on it, where linearity hasthe same meaning as in Definition 1. When f is a linear functional, it is customary to usethe notation h', f i in place of f (').

The set of all linear functionals on X , denoted as X §, can be given the structure of avector space in the obvious way by the identity

h', a f +b f0i= ah', f i+bh', f0i,

where ' 2 X , f , f0 2 X §, and a,b 2 R (or C) are arbitrary. The resulting vector space X §

is called the algebraic dual of X .

The map from X £X § toR (orC) that takes the pair (', f ) to their so-called scalar producth', f i is then bilinear in the sense that it is linear in each of the arguments ' and f .

27




Given vector spaces X ,Y with algebraic duals X §,Y § and a linear operator A : X !Y ,the adjoint or transpose of A, denoted as A§, is the linear operator Y § !X § defined by

A§ f = f ±A

for any linear functional f : Y ! R (or C) in Y §, where ± denotes composition. The mo-tivation behind the above definition is to have the identity

hA', f i= h',A§ f i (3.2)

hold for all ' 2X and f 2Y §.

If X is a topological vector space, it is of interest to consider the subspace of X § com-posed of those linear functionals on X that are continuous with respect to the topologyof X . This subspace is denoted as X 0 and called the topological or continuous dual of X .Note that, unlike X §, the continuous dual generally depends on the topology of X . Inother words, the same vector space X with different topologies will generally have differ-ent continuous duals.

As a general rule, in this book we shall adopt some standard topologies and only workwith the corresponding continuous dual space, which we shall call simply the dual. Also,henceforth, we shall assume the scalar product h·, ·i to be restricted to X £X 0. There, thespace X may vary but is necessarily paired with its continuous dual.

Following the restrictions of the previous paragraph, we sometimes say that the adjointof A : X !Y exists, to mean that the algebraic adjoint A§ : Y § !X §, when restricted toY 0, maps into X 0, so that we can write

hA', f i= h',A§ f i,

where the scalar products on the two sides are now restricted to Y £Y 0 and X £X 0,respectively.

One can define different topologies on X 0 by providing various criteria for convergence.The only one we shall need to deal with is the weak-§ topology, which indicates (for asequential space X ) that ( fi ) converges to f in X 0 if and only if

limih', fi i= h', f i for all ' 2X .

This is precisely the topology of pointwise convergence for all “points” ' 2X .

We shall now mention some examples.

3.2.1 The dual of Lp spaces

The dual of the Lebesgue space Lp (Rd ), 1 ∑ p < 1, can be identified with the spaceLp 0 (Rd ) with 1 < p 0 ∑1 satisfying 1/p +1/p 0 = 1, by defining

h', f i=Z

Rd'(r ) f (r ) dr (3.3)

for ' 2 Lp (Rd ) and f 2 Lp 0 (Rd ). In particular, L2(Rd ), which is the only Hilbert space ofthe family, is its own dual.

To see that linear functionals described by the above formula with f 2 Lp 0 are continuouson Lp , we can rely on Hölder’s inequality, which states that

|h', f i|∑Z

Rd|'(r ) f (r )| dr ∑ k'kpk f kp 0

for 1 ∑ p, p 0 ∑1 and 1/p +1/p 0 = 1. The special case of this inequality for p = 2 yields theCauchy-Schwarz inequality.

28



3.2. Dual spaces and adjoint operators

3.2.2 The duals of D and S

In this subsection, we give the mathematical definition of the duals of the nuclear spacesD and S . A physical interpretation of these definitions is postponed until the next sec-tion.

The dual of D(Rd ), denoted as D0(Rd ), is the so-named space of distributions over Rd (al-though we shall use the term distribution more generally to mean any generalized func-tion in the sense of the next section). Ordinary locally integrable functions 2 (in particular,all Lp functions and all continuous functions), can be identified with elements of D0(Rd )by using (3.3). By this, we mean that any locally integrable function f defines a continu-ous linear functional on D(Rd ) where, for ' 2 D(Rd ), h', f i is given by (3.3). However,not all elements of D0(Rd ) can be characterized in this way. For instance, the Dirac func-tional ±, which maps ' 2 D(Rd ) to the value h',±i = '(0), belongs in D0(Rd ) but cannotbe written as an integral à la (3.3). Even in this and similar cases, we may sometimeswrite

R

Rd '(r ) f (r ) dr , keeping in mind that the integral is no longer a true (i.e., Lebesgue)integral, but simply an alternative notation for h', f i.In similar fashion, the dual of S (Rd ), denoted as S 0(Rd ), is defined and called the spaceof tempered (or Schwartz) distributions. Since D ΩS and any sequence that converges inthe topology of D also converges in S , it follows that S 0(Rd ) is (can be identified with) asmaller space (i.e., a subspace) of D0(Rd ). In particular, not every locally-integrable func-tion belongs in S 0. For example, locally-integrable functions of exponential growth haveno place in S 0 as their scalar product with Schwartz test functions via (3.3) is not in gen-eral finite (much less continuous). Once again, S 0(Rd ) contains objects that are not func-tions on Rd in the true sense of the word. For example, ± also belongs in S 0(Rd ).

3.2.3 Distinction between Hermitian and duality products

We use the notation h f , g iL2 =R

Rd f (r )g (r ) dr to represent the usual (Hermitian-symmetric) L2 inner product. The latter is defined for f , g 2 L2(Rd ) (the Hilbert spaceof complex finite-energy functions); it is equivalent to Schwartz’ duality product onlywhen the second argument is real-valued (due to the presence of complex conjugation).The corresponding Hermitian adjoint of an operator A is denoted by AH . It is definedas hAH f , g iL2 = h f ,Ag iL2 = h f ,Ag i which implies that AH = A§. The distinction betweenboth types of adjoints is only relevant when considering signal expansions or analyses interms of complex basis functions.

The classical Fourier transform is defined as

f (!) =F { f }(!) :=Z

Rdf (r )e°jhr ,!i dr

for any f 2 L1(Rd ). This definition admits a unique extension, F : L2(Rd ) ! L2(Rd ), whichis an isometry map (Plancherel’s theorem). The fact that the Fourier transform preservesthe L2 norm of a function (up to a normalization factor) is a direct consequence of Par-seval’s relation

h f , g iL2 =1

(2º)dh f , g iL2 ,

whose duality product equivalent is h f , g i= h f , g i.

2. A function on Rd is called locally integrable if its integral over any closed bounded set is finite.

29




3.3 Generalized functions

3.3.1 Intuition and definition

We begin with some considerations regarding the modeling of physical phenomena. Letus suppose that the object of our study is some physical quantity f that varies in relationto some parameter r 2 Rd representing space and/or time. We assume that our way ofobtaining information about f is by making measurements that are localized in space-time using sensors (',√, . . .). We shall denote the measurement of f procured by ' ash', f i. 3 Let us suppose that our sensors form a vector space, in the sense that for any twosensors ',√ and any two scalars a,b 2 R (or C), there is a real or virtual sensor a'+b√such that

ha'+b√, f i= ah', f i+bh√, f i.

In addition, we may reasonably suppose that the phenomenon under observation hassome form of continuity, meaning that

limih'i , f i= h', f i,

where ('i ) is a sequence of sensors that tend to ' in a certain sense. We denote the setof all sensors by X . In the light of the above notions of linear combinations and limitsdefined in X , mathematically, the space of sensors then has the structure of a topologicalvector space.

Given the above properties and the definitions of the previous sections, we conclude thatf represents an element of the continuous dual X 0 of X . Given that our sensors, as pre-viously noted, are assumed to be localized in Rd , we may model them as compactly sup-ported or rapidly decaying functions on Rd , denoted by the same symbols (',√, . . .) and,in the case where f also corresponds to a function on Rd , relate the observation h', f i tothe functional form of ' and f by the identity

h', f i=Z

Rd'(r ) f (r ) dr .

We exclude from consideration those functions f for which the above integral is un-defined or infinite for some ' 2X .

However, we are not limited to taking f to be a true function of r 2 Rd . By requiring oursensor or test functions to be smooth, we can permit f to become singular; that is, todepend on the value of ' and/or of its derivatives at isolated points/curves inside Rd . Anexample of a singular generalized function f , which we have already noted, is the Diracdistribution ± that measures the value of ' at the single point r = 0 (i.e., h',±i='(0)).

Mathematically, we define generalized functions as members of the continuous dual X 0

of a nuclear space X of functions, such as D(Rd ) or S (Rd ).

3.3.2 Operations on generalized functions

Following (3.2), any continuous linear operator D ! D or S ! S can be transposed todefine a continuous linear operator D§ !D§ or S § !S §. In particular, since D(Rd ) andS (Rd ) are closed under differentiation, we can define derivatives of distributions.

First, note that, formally,h@n', f i= h',@n§ f i.

3. The connection with previous sections should already be apparent from this choice of notation.

30



3.3. Generalized functions

Now, using integration by parts in (3.3), for ', f in D(Rd ) or S (Rd ) we see that @n§ =(°1)|n|@n . In other words, we can write

h',@n f i= (°1)nh@n', f i. (3.4)

The idea is then to use (3.4) as the defining formula in order to extend the action of thederivative operator @n for any f 2D0(Rd ) or S 0(Rd ).

Formulas for scaling, shifting (translation), rotation, and other geometric transformationsof distributions are obtained in a similar manner. For instance, the translation by r0 of ageneralized function f is defined via the identity

h', f (·° r0)i= h'(·+ r0), f i.

More generally, we give the following definition.

Definition 3. Given operators U,U§ : S (Rd ) ! S (Rd ) that form an adjoint pair onS (Rd )£S (Rd ), we extend their action to S 0(Rd ) ! S 0(Rd ) by defining U f and U§ f soas to have

h',U f i= hU§', f i,h',U§ f i= hU', f i,

for all f . A similar definition gives the extension of adjoint pairs D(Rd ) !D(Rd ) to operat-ors D0(Rd ) !D0(Rd ).

Examples of operators S (Rd ) !S (Rd ) that can be extended in the above fashion includederivatives, rotations, scaling, translation, time-reversal, and multiplication by smoothfunctions of slow growth in the space-time domain. The other fundamental operation isthe Fourier transform which is treated in the next section.

3.3.3 The Fourier transform of generalized functions

We have already noted that the Fourier transform F is a reversible operator that maps the(complexified) space S (Rd ) into itself. The additional relevant property is that F is self-adjoint: h',F√i= hF',√i, for all ',√ 2S (Rd ). This helps us specifying the generalizedFourier transform of distributions in accordance with the general extension principle inDefinition 3.

Definition 4. The generalized Fourier transform of a distribution f 2S 0(Rd ) is the distri-bution f =F { f } 2S 0(Rd ) that satisfies

h', f i= h', f i

for all ' 2S , where '=F {'} is the classical Fourier transform of ' given by the integral

'(!) =Z

Rde°jhr ,!i'(r ) dr .

For example, since we haveZ

Rd'(r ) dr = h',1i= '(0) = h',±i,

we conclude that the (generalized) Fourier transform of ± is the constant function 1.

31




Temporal or spatial domain Fourier domain

bf (r ) =F { f }(r ) (2º)d f (°!)

f _(r ) = f (°r ) f (°!) = f _(!)

f (r ) f (°!)

f (ATr ) 1

|detA|bf (A°1!)

f (r °h) e°jhh,!ibf (!)

ejhr ,!0i f (r ) bf (!°!0)

@n f (r ) (j!)n

bf (!)

r

n f (r ) j|n|@n

bf (!)

(g § f )(r ) bg (!) bf (!)

g (r ) f (r ) (2º)°d (bg § bf )(!)

Table 3.3: Basic properties of the (generalized) Fourier transform.

The fundamental property of the generalized Fourier transform is that it maps S 0(Rd )into itself and that it is invertible with F°1 = 1

(2º)d F where F { f } = F { f _}. This quasiself-reversibility—also expressed by the first row of Table 3.3—implies that any operationon generalized functions that is admissible in the space/time domain has its counterpartin the Fourier domain, and vice versa. For instance, the multiplication with a smoothfunction in the Fourier domain corresponds to a convolution in the signal domain. Con-sequently, the familiar functional identities concerning the classical Fourier transformsuch as the formulas for change of variables, differentiation, among others, also hold truefor this generalization. These are summarized in Table 3.3.

In addition, the reader can find in Appendix A a table of Fourier transforms of some im-portant singular generalized functions in one and several variables.

3.3.4 The kernel theorem

The kernel theorem provides a characterization of continuous operators X ! X 0 (withrespect to the nuclear topology on X and the weak-§ topology on X 0). We shall statea version of the theorem for X = S (Rd ), which is the one we shall use. The version forD(Rd ) is obtained by replacing the symbol S with D everywhere in the statement of thetheorem.

Theorem 1 (Schwartz’ kernel theorem: first form). Every continuous linear operatorS (Rd ) !S 0(Rd ) can be written in the form

'(r ) 7!Z

RdF (r , s)'(s) ds, (3.5)

where F (·, ·) is a generalized function in S 0(Rd £Rd ).

We can interpret the above formula as some sort of continuous-domain matrix-vectorproduct, where r , s play the role of vector indices. This characterization of continuouslinear operators as infinite-dimensional matrix-vector products partly justifies our earlierstatement that nuclear spaces “resemble” finite-dimensional spaces in some importantways.

32



3.3. Generalized functions

An equivalent statement of the above theorem is as follows.

Theorem 2 (Schwartz’s kernel theorem: second form). Every continuous bilinear forml : S (Rd )£S (Rd ) !R (or C) can be written as

l ('1,'2) = h'1(·)'2(·),F (·, ·)i=Z

Rd£Rd'1(r )'2(s)F (r , s) ds dr .

The connection between the two statements is clarified by relating the continuous bilin-ear form l to a continuous linear operator A : S (Rd ) !S 0(Rd ) by an identity of the form

l ('1,'2) = h'1,A'2i.

3.3.5 Linear shift-invariant (LSI) operators and convolutions

Let Sr0 denote the shift operator ' 7! '(·° r0). We call an operator U shift-invariant if

USr0 = S

r0 U for all r0 2Rd .

As a corollary of the kernel theorem, we have the following characterization of linear shift-invariant (LSI) operators S !S 0 (and a similar characterization for those D !D0).

Corollary 1. Every continuous linear shift-invariant operator S (Rd ) ! S 0(Rd ) can bewritten as a convolution

'(r ) 7! ('§F )(r ) =Z

RdF (r ° s)'(s) ds

with some generalized function F 2S 0(Rd ).

Moreover, in this case we have the convolution-multiplication formula

F {'§F } = 'F .

Note that the convolution of a test function and a distribution is in general a distribution.The latter is smooth (and therefore equivalent to an ordinary function), but not neces-sarily rapidly decaying. However, '§F will once again belong continuously to S if bF ,the Fourier transform of F , is a smooth (infinitely differentiable) function with at mostpolynomial growth at infinity because the smoothness of F translates into F having rapiddecay in the spatio-temporal domain, and vice versa. In particular, we note that the con-dition is met when F 2R(Rd ) (since r

n F (r ) 2 L1(Rd ) for any n 2Nd ). A classical situationin dimension d = 1 where the decay is guaranteed to be exponential is when the Fouriertransform of F is a rational transfer function of the form

F (!) =QM

m=1(j!° zm)QN

n=1(j!°pn)

with no purely imaginary pole (i.e., with Re(pn) 6= 0, 1 ∑ n ∑ N ). 4

Since any sequence that converges in some Lp space, with 1 ∑ p ∑1, also converges inS 0, the kernel theorem implies that any continuous linear operator S (Rd ) ! Lp (Rd ) canbe written in the form specified by (3.5).

In defining the convolution of two distributions, some caution should be exerted. To beconsistent with the previous definitions, we can view convolutions as continuous linearshift-invariant operators. The convolution of two distributions will then correspond to

4. For M or N = 0, we shall take the corresponding product to be equal to 1.

33




the composition of two LSI operators. To fix ideas, let us take two distributions f and g ,with corresponding operators A f and Ag . We then wish to identify f § g with the com-position A f Ag . However, note that, by the kernel theorem, A f and Ag are initially definedS !S 0. Since the codomain of Ag (the space S 0) does not match the domain of A f (thespace S ), this composition is a priori undefined.

There are two principal situations where we can get around the above limitation. The firstis where the range of Ag is limited to S ΩS 0 (i.e., Ag maps S to itself instead of the muchlarger S 0). This is the case for the distributions with a smooth Fourier transform that wediscussed previously.

The second situation where we may define the convolution of f and g is when the rangeof Ag can be restricted to some space X (i.e, Ag : S ! X ), and furthermore, A f has acontinuous extension to X ; that is, we can extend it as A f : X !S 0.

An important example of the second situation is when the distributions in question be-long to the spaces Lp (Rd ) and Lq (Rd ) with 1 ∑ p, q ∑ 1 and 1/p + 1/q ∑ 1. In thiscase, their convolution is well-defined and can be identified with a function in Lr (Rd ),1 ∑ r ∑1, with

1r= 1° 1

p° 1

q.

Moreover, for f 2 Lp (Rd ) and g 2 Lq (Rd ), we have

k f § gkr ∑ k f kpkgkq .

This result is Young’s inequality for convolutions. An important special case of this iden-tity, most useful in derivations, is obtained for q = 1 and p = r :

k f § gkp ∑ k f kpkgk1.

The latter formula indicates that Lp (Rd ) spaces are “stable” under convolution with ele-ments of L1(Rd ) (stable filters).

3.4 Probability theory

3.4.1 Probability measures

In this section we give an informal introduction to probability theory. A more precisepresentation can be found in Appendix D.

Probability measures are mathematical constructs that permit us to assign numbers(probabilities) between 0 (almost impossible) to 1 (almost sure) to events. An event ismodeled by a subset A of the universal set ≠X of all outcomes of a certain experiment X ,which is assumed to be known. The symbol PX (A) then gives the probability that someelement of A occurs as the outcome of experiment X . Note that, in general, we may assignprobabilities only to some subsets of ≠X . We shall denote the collection of all subsets of≠X for which PX is defined as SX .

The probability measure PX then corresponds to a function SX ! [0,1]. The triple(≠X ,SX ,PX ) is called a probability space.

Frequently, the collection SX contains open and closed sets, as well as their countableunions and intersections, collectively known as Borel sets. In this case we call PX a Borelprobability measure.

An important application of the notion of probability is in computing the “average” valueof some (real- or complex-valued) quantity f that depends on the outcome in ≠X . This

34



3.4. Probability theory

quantity, the computation of which we shall discuss shortly, is called the expected valueof f , and is denoted as E{ f (X )}.

An important context for probabilistic computations is when the outcome of X can beencoded as a finite-dimensional numerical sequence, which implies that we can identify≠X with Rn (or a subset thereof). In this case, within the proper mathematical setting,we can find a (generalized) function pX , called the probability distribution 5 or densityfunction (pdf) of X , such that

PX (A) =Z

ApX (x) dx

for suitable subsets A of Rn . 6

More generally, the expected value of f : X !C is here given by

E{ f (X )} =Z

Rnf (x)pX (x) dx . (3.6)

We say “more generally” because PX (A) can be seen as the expected value of the indic-ator function A(X ). Since the integral of complex-valued f can be written as the sum ofits real and imaginary parts, without loss of generality we shall consider only real-valuedfunctions where convenient.

When the outcome of the experiment is a vector with infinitely many coordinates (forinstance a function R! R), it is typically not possible to characterize probabilities withprobability distributions. It is nevertheless still possible to define probability measures onsubsets of≠X , and also to define the integral (average value) of many a function f :≠X !R. In effect, a definition of the integral of f with respect to probability measure PX isobtained using a limit of “simple” functions (finite weighted sums of indicator functions)that approximate f . For this general definition of the integral we use the notation

E{ f (X )} =Z

≠X

f (x) PX (dx),

which we may also use, in addition to (3.6), in the case of a finite-dimensional≠X .

In general, given a function f : ≠X ! ≠Y that defines a new outcome y 2 ≠Y for everyoutcome x 2≠X of experiment X , one can see the result of applying f to the outcome ofX as a new experiment Y . The probability of an event B Ω≠Y is the same as the combinedprobability of all outcomes of X that generate an outcome in B . Thus, mathematically,

PY (B) =PX ( f °1(B)) =PX ± f °1(B),

where the inverse image f °1(B) is defined as

f °1(B) = {x 2≠X : f (x) 2 B}.

PY =PX ( f °1·) is called the push-forward of PX through f .

5. Probability distributions should not be confused with the distributions in the sense of Schwartz (i.e., gen-eralized functions) that were introduced in Section 3.3. It is important to distinguish the two usages, in partbecause, as we describe here, in finite dimensions a connection can be made between probability distributionsand positive generalized functions.

6. In classical probability theory, pdfs are defined as the Radon-Nikodym derivative of a probability meas-ure with respect to some other measure, typically the Lebesgue measure (as we shall assume). This requiresthe probability measure to be absolutely continuous with respect to the latter measure. The definition of thegeneralized pdf given here is more permissive, and also includes measures that are singular with respect to theLebesgue measure (for instance the Dirac measure of a point, for which the generalized pdf is a Dirac distribu-tion). This generalization relies on identifying measures on the Euclidean space with positive linear functionals.

35




3.4.2 Joint probabilities and independence

When two experiments X and Y with probabilities PX and PY are considered simultan-eously, one can imagine a joint probability space (≠X ,Y ,SX ,Y ,PX ,Y ) that supports bothX and Y , in the sense that there exist functions f :≠X ,Y !≠X and g :≠X ,Y !≠Y suchthat

PX (A) =PX ,Y ( f °1(A)) and PY (B) =PX ,Y (g°1(B))

for all A 2SX and B 2SY .

The functions f , g above are assumed to be fixed, and the joint event that A occurs for Xand B for Y , is given by

f °1(A)\ g°1(B).

If the outcome of X has no bearing on the outcome of Y and vice-versa, then X and Yare said to be independent. In terms of probabilities, this translates into the probabilityfactorization rule

PX ,Y ( f °1(A)\ g°1(B)) =PX (A) ·PY (B) =PX ,Y ( f °1(A)) ·PX ,Y (g°1(B)).

The above ideas can be extended to any finite collection of experiments X1, . . . , XM (andeven to infinite ones, with appropriate precautions and adaptations).

3.4.3 Characteristic functions in finite dimensions

In finite dimensions, given a probability measure PX on ≠X = Rn , for any vector ª 2 Rn ,we can compute the expected value (integral) of the bounded function x 7! ejhª,xi. Thispermits us to define a complex-valued function on Rn by the formula

bpX (ª) = E{ejhª,xi} =Z

Rnejhª,xipX (x) dx =F {pX }(ª), (3.7)

which corresponds to a slightly different definition of the Fourier transform of the (gen-eralized) probability distribution pX . The convention in probability theory is to definethe forward Fourier transform with a positive sign for jhª, xi, which is the opposite of theconvention used in analysis. To avoid confusion, we use the variable ª as the Fourier vari-able in probabilistic Fourier transforms [characteristic functions], and ! in the analyticdefinition.

One can prove that bpX , as defined above, is always continuous at 0 with bpX (0) = 1, andthat it is positive-definite (see Definition 26 in Appendix B).

Remarkably, the converse of the above fact is also true. We record the latter result, whichis due to Bochner, together with the former observation, as Theorem 3.

Theorem 3 (Bochner). Let bpX : Rn ! C be a function that is positive-definite, fulfillsbpX (0) = 1, and is continuous at 0. Then, there exists a unique Borel probability measurePX on Rn, such that

bpX (ª) =Z

Rnejhª,xiPX (dx) = E{ejhª,xi}.

Conversely, the function specified by (3.7) with pX (r ) ∏ 0 andR

Rn pX (r ) dr = 1 is positive-definite, uniformly continuous, and such that | bpX (ª)|∑ bpX (0) = 1.

The interesting twist (which is due to Lévy) is that the positive-definiteness of pX and itscontinuity at 0 implies continuity everywhere (as well as boundedness).

36



3.4. Probability theory

Since, by the above theorem, bpX uniquely identifies PX , it is called the characteristicfunction of probability measure PX (recall that the probability measure PX is related tothe density pX by PX (E) =

R

E pX (x) dx for sets E in the æ-algebra over Rn).

The next theorem characterizes weak convergence of measures on Rn in terms of theircharacteristic functions.

Theorem 4 (Lévy’s continuity theorem). Let (PXi ) be a sequence of probability measureson Rn with respective sequence of characteristic functions ( bpXi ). If there exists a functionbpX such that

limi

bpXi (ª) = bpX (ª)

pointwise on Rn, and if, in addition, bpX is continuous at 0, then bpX is the characteristicfunction of a probability measure PX on Rn. Moreover, PXi converges weakly to PX , insymbols

PXi

w°!PX ,

meaning for any continuous function f :Rn !R,

limiEXi { f } = EX { f }.

The reciprocal of the above theorem is also true; namely, if PXi

w°! PX , then bpXi (ª) !bpX (ª) pointwise.

3.4.4 Characteristic functionals in infinite dimensions

Given a probability measure PX on the continuous dual X 0 of some test function spaceX , one can define an analogue of the finite-dimensional characteristic function, dubbedthe characteristic functional of PX and denoted as cPX , by means of the identity

cPX (') = E{ejh',X i}. (3.8)

Comparing the above definition with (3.7), one notes that Rn , as the domain of the char-acteristic function bpX , is now replaced by the space X of test functions.

As was the case in finite dimensions, the characteristic functional fulfills two importantconditions:

Positive-definiteness: cPX is positive-definite, in the sense that for any N (test) func-tions '1, . . . ,'N , for any N , the N £N matrix with entries pi j = cPX ('i °' j ) is non-negative definite.Normalization: cPX (0) = 1.

In view of the finite-dimensional result (Bochner’s theorem), it is natural to ask if a condi-tion in terms of continuity can be given also in the infinite-dimensional case, so that anyfunctional cPX fulfilling this continuity condition in addition to the above two, uniquelyidentifies a probability measure on X 0. In the case where X is a nuclear space (and, inparticular, for X =S (Rd ) or D(Rd ), cf. Subsection 3.1.3) such a condition is given by theMinlos-Bochner theorem.

Theorem 5 (Minlos-Bochner). Let X be a nuclear space and let cPX : X ! C be a func-tional that is positive-definite in the sense discussed above, fulfills cPX (0) = 1, and is con-tinuous X ! C. Then, there exists a unique probability measure PX on X 0 (the continu-ous dual of X ), such that

cPX (') =Z

X 0ejh',xiPX (dx) = E{ejh',X i}.

37




Conversely, the characteristic functional associated to some probability measure PX on X 0

is positive-definite, continuous over X , and such that cPX (0) = 1.

The practical implication of this result is that one can rely on characteristic functionalsto indirectly specify infinite-dimensional measures (most importantly, probabilities ofstochastic processes)—which are difficult to pin down otherwise. Operationally, the char-acteristic functional cPX (') is nothing but a mathematical rule (e.g., cPX (') = e°

12 k'k

22 )

that returns a value in C for any given function ' 2 S . The truly powerful aspect is thatthis rule condenses all the information about the statistical distribution of some underly-ing infinite-dimensional random object X . When working with characteristic functionals,we shall see that computing probabilities and deriving various properties of the said pro-cesses are all reduced to analytical derivations.

3.5 Generalized random processes and fields

In this section, we present an introduction to the theory of generalized random pro-cesses, which is concerned with defining probabilities on function spaces, that is, infinite-dimensional vector spaces with some notion of limit and convergence. We have madethe point before that the theory of generalized functions is a natural extension of finite-dimensional linear algebra. The same kind of parallel can be drawn between the theoryof generalized stochastic processes and conventional probability calculus (which dealswith finite-dimensional random vector variables). Therefore, before getting into moredetailed explanations, it is instructive to have a look back at Table 3.2, which provides aside-by-side summary of the primary probabilistic concepts that have been introducedso far. The reader is then referred to Table 3.4, which presents a comparison of finite- andinfinite-dimensional “innovation models”. To give the basic idea, in finite dimensions, an“innovation” is a vector in Rn of independent identically distributed (i.i.d.) random vari-ables. An “innovation model” is obtained by transforming such a vector by means of alinear operator (a matrix), which embodies the structure of dependencies of the model.In infinite dimensions, the notion of an i.i.d. vector is replaced by that of a random processwith independent values at every point (which we shall call an “innovation process”). Thetransformation is achieved by applying a continuous linear operator which constitutesthe generalization of a matrix. The characterization of such models is made possible bytheir characteristic functionals, which, as we saw in the previous section, are the infinite-dimensional equivalents of characteristic functions of random variables.

3.5.1 Generalized random processes as collections of random variables

A generalized stochastic process 7 is essentially a randomization of the idea of a general-ized function (Section 3.3) in much the same way as an ordinary stochastic process is arandomization of the concept of a function.

At a minimum, the definition of a generalized stochastic process s should permit us to as-sociate probabilistic models with observations made using test functions. In other words,to any test function ' in some suitable test-function space X is associated a randomvariable s('), also often denoted as h', si. This is to be contrasted with an observations(t ) at time t , which would be modeled by a random variable in the case of an ordinarystochastic process. We shall denote the probability measure of the random variable h', si

7. We shall use the terms random/stochastic process and field almost interchangeably. The distinction, ingeneral, lies in the fact that for a random process, the parameter is typically interpreted as time, while for a field,the parameter is typically multi-dimensional and interpreted as spatial or spatio-temporal location.

38



3.5. Generalized random processes and fields

finite-dimensional infinite-dimensional

standard Gaussian i.i.d. vector W =(W1, . . . ,WN )

standard Gaussian white noise w

pW (ª) = e°12 |ª|

2, ª 2RN

cPw (') = e°12 k'k

22 , ' 2S

multivariate Gaussian vector X Gaussian generalized process s

X = AW s = Aw (for continuous A : S 0 !S 0)

pX (ª) = e°12 |A

Tª|2cPs (') = e°

12 kA§'k2

general i.i.d. vector W = (W1, . . . ,WN )with exponent f

general white noise w with Lévy expo-nent f

pW (ª) = ePN

i=1 f (ªi )cPw (') = e

R

Rd f°

'(r )¢

dr

linear transformation of general i.i.d.random vector W (innovation model)

linear transformation of general whitenoise s (innovation model)

X = AW s = Aw

pX (ª) = pW (ATª) cPs (') = cPw (A§')

Table 3.4: Comparison of innovation models in finite- and infinite-dimensional settings.See Sections 4.3-4.5 for a detailed explanation.

as Ps,'. Similarly, to any finite collection of observations h'i , si, 1 ∑ i ∑ N , N 2 N, cor-responds a joint probability measure Ps,'1:'N on RN (we shall only consider real-valuedprocesses here, and therefore assume the observations to be real-valued).

Moreover, finite families of observations h'i , si, 1 ∑ i ∑ N , and h√ j , si, 1 ∑ j ∑ M , need tobe consistent or compatible, as explained in Appendix D.4, to ensure that all computationsof the probability of an event involving finite observations yield the same value for theprobability. In modeling physical phenomena, it is also reasonable to assume some weakform of continuity in the probability of h', si as a function of '.

Mathematically, these requirements are fulfilled by the kind of probabilistic model in-duced by a cylinder-set probability measure, as discussed in Appendix D.4. In other words,a cylinder-set probability measure provides a consistent probabilistic description for allfinite sets of observations of some phenomenon s using test functions ' 2 X . Further-more, a cylinder-set probability measure can always be specified via its characteristicfunctional cPs (') = E{ejh',si}, which makes it amenable to analytic computations.

The only conceptual limitation of such a probability model is that, at least a priori, it doesnot permit us to associate the sample paths of the process with (generalized) functions.Put differently, in this framework, we are not allowed to interpret s as a random entity be-longing to the dual X 0 of X , since we have not yet defined a proper probability measureon X 0. 8 Doing so involves some additional steps.

8. In fact, X 0 may very well be too small to support such a description (while the algebraic dual, X §, cansupport the measure—by Kolmogorov’s extension theorem—but is too large for many practical purposes). Animportant example is that of white Gaussian noise, which one may conceive of as associating a Gaussian randomvariable with variance k'k2

2 to any test function' 2 L2. However, the “energy” of white Gaussian noise is clearlyinfinite. Therefore it cannot be modeled as a randomly chosen function in (L2)0 = L2.

39




3.5.2 Generalized random processes as random generalized functions

Fortunately, the above existence and interpretation problem is fully resolved by taking X

to be a nuclear space, thanks to the Minlos-Bochner theorem (Theorem 5). This allowsfor the extension of the underlying cylinder-set probability measure to a proper (by whichhere we mean countably additive) probability measure on X 0 (the topological dual of X ),as sketched out in Appendix D.4.

In this case, the joint probabilities Ps,'1:'N , '1, . . . ,'N 2 X , N 2 N, corresponding tothe random variables h'i , si for all possible choices of test functions, collectively define aprobability measure Ps on the infinite-dimensional dual space X 0. This means that wecan view s as an element drawn randomly from X 0 according to the probability law Ps .

In particular, if we take X to be either S (Rd ) or D(Rd ), then our generalized random pro-cess/field will have realizations that are distributions in S 0(Rd ) or D0(Rd ), respectively.We can then also think of h', si as the measurement of this random object s by means ofsome sensor (test function) ' in S or D.

Since we shall rely on this fact throughout the book, we reiterate once more that a com-plete probabilistic characterization of s as a probability measure on the space X 0 (dualto some nuclear space X ) is provided by its characteristic functional. The truly powerfulaspect of the Minlos-Bochner theorem is that the implication goes both ways: any con-tinuous positive-definite functional cPs : X ! C with proper normalization identifies aunique probability measure Ps on X 0. Therefore, to define a generalized random pro-cess s with realizations in X 0, it suffices to produce a functional cPs : X ! C with thenoted properties.

3.5.3 Determination of statistics from the characteristic functional

The characteristic functional of the generalized random process s contains complete in-formation about its probabilistic properties, and can be used to compute all probabilities,and to derive or verify the probabilistic properties related to s.

Most importantly, it can yield the N th-order joint probability density of any set of linearobservations of s by suitable N -dimensional inverse Fourier transformation. This followsfrom a straightforward manipulation in the domain of the characteristic function and isrecorded for further reference.

Proposition 1. Let Y = (hs,'1i, . . . ,hs,'N i) with '1, . . . ,'N 2 X be a set of linear meas-urements of the generalized stochastic process s with characteristic functional cPs (') =E{ejh',si} that is continuous over the function space X . Then,

pY (ª) = cPs,'1:'N (ª) = cPs

√

NX

n=1ªn'n

!

and the joint pdf of Y is given by

pY (y) =F°1

{pY }(y) =Z

RNcPs

√

NX

n=1ªn'n

!

e°jhy ,ªi dª(2º)n ,

where the observation functions 'n 2 X are fixed and ª = (ª1, . . . ,ªN ) plays the role of theN -dimensional Fourier variable.

Proof. The continuity assumption over the function space X (which need not be nuclear)ensures that the manipulation is legitimate. Starting from the definition of the character-

40




istic function of Y , we have

pY (ª) = E©

exp°

jhª, yi¢™

= E

(

exp°

jNX

n=1ªnhs,'ni

¢

)

= E

(

exp°

jhNX

n=1ªn'n , si

¢

)

(by linearity of duality product)

= cPs

√

NX

n=1ªn'n

!

(by definition of cPs ('))

The density pY is then obtained by inverse (conjugate) Fourier transformation.

Similarily, the formalism allows one to retrieve all first- and second-order moments of thegeneralized stochastic process s. To that end, one considers the mean and correlationfunctionals defined and computed as

Ms (') := E{h', si} = (°j)d

dªcPs,'(ª)

Ø

Ø

ª=0

= (°j)d

dªcPs (ª')

Ø

Ø

ª=0.

Bs ('1,'2) := E{h'1, sih'2, si} = (°j)2 @2

@ª1@ª2

cPs,'1,'2 (ª1,ª2)Ø

Ø

ª1,ª2=0

= (°j)2 @2

@ª1@ª2

cPs (ª1'1 +ª2'2)Ø

Ø

ª1,ª2=0.

When the space of test functions is nuclear (X =S (Rd ) or D(Rd )) and the above quantit-ies are well defined, we can find generalized functions ms (the generalized mean) and cs(the generalized autocorrelation function) such that

Ms (') =Z

Rd'(r )ms (r ) dr , (3.9)

Bs ('1,'2) =Z

Rd'1(r )'2(s)cs (r , s) dr . (3.10)

The first identity is simply a consequence of Ms being a continuous linear functional onX , while the second is an application of Schwartz’ kernel theorem (Theorem 2).

3.5.4 Operations on generalized stochastic processes

In constructing stochastic models, it is of interest to separate the essential randomness ofthe models (the “innovation”) from their deterministic structure. Our way of approachingthis objective is by encoding the random part in a characteristic functional cPw , and thedeterministic structure of dependencies in an operator U (or, equivalently, in its adjointU§). In the following paragraphs, we first review the mathematics of this construction,before we come back to, and clarify, the said interpretation. The concepts presented herein an abstract form are illustrated and made intuitive in the remainder of the book.

Given a continuous linear operator U : X ! Y with continuous adjoint U§ : Y 0 ! X 0,where X ,Y need not be nuclear, and a functional

cPw : Y !C

41




that satisfies the three conditions Theorem 5 (continuity, positive-definiteness, and nor-malization), we obtain a new functional

cPs : X !C

fulfilling the same properties by composing cPw and U as per

cPs (') = cPw (U') for all ' 2X . (3.11)

WritingcPs (ª') = E{ejªh',si} = bph',si(ª)

andcPw (ªU') = E{ejªhU',wi} = bphU',wi(ª)

for generalized processes s and w , we deduce that the random variables h', si andhU', wi have the same characteristic functions and therefore follow the same law

h', si= hU', wi in probability law.

The manipulation that led to Proposition 1 shows that a similar relation exists, more gen-erally, for any finite collection of observations h'i , si and hU'i , wi, 1 ∑ i ∑ N , N 2N.

Therefore, symbolically at least, by the definition of the adjoint U§ : Y 0 ! X 0 of U, wemay write

h', si= h',U§wi.

This seems to indicate that, in a sense, the random model s, which we have defined using(3.11), can be interpreted as the application of U§ to the original random model w . How-ever, things are complicated by the fact that, unless X and Y are nuclear spaces, we maynot be able to interpret w and s as random elements of Y 0 and X 0, respectively. There-fore the application of U§ : Y 0 ! X 0 to s should be understood to be merely a formalconstruction.

On the other hand, by requiring X to be nuclear and Y to be either nuclear or completelynormed, we see immediately that cPs : X ! C fulfills the requirements of the Minlos-Bochner theorem, and thereby defines a generalized random process with realizations inX 0.

The previous discussion suggests the following approach to defining generalized randomprocesses: take a continuous positive-definite functional cPw : Y ! C on some (nuclearor completely normed) space Y . Then, for any continuous operator U defined from anuclear space X into Y , the composition

cPs = cPw (U·)

is the characteristic functional of a generalized random process s with realizations in X 0.

In subsequent chapters, we shall mostly focus on the situation where U = L°1§ and U§ =L°1 for some given (whitening) operator L that admits a continuous inverse in the suitabletopology, the typical choice of spaces being X =S (Rd ) and Y = Lp (Rd ). The underlyinghypothesis is that one is able to invert the linear operator U and to recover w from s,which is formally written as w = Ls; that is,

h', wi= h',Lsi, for all ' 2Y .

The above ideas are summarized in Figure 3.1.

42




� �

�

� �

��Ps(�) = �Pw(U�) with U = L�1�

U�

w s = L�1w

�Ps�Pw

U

Figure 3.1: Definition of linear transformation of generalized stochastic processes usingcharacteristic functionals. In this book, we shall focus on innovation models where w isa white noise process. The operator L = U°1§ (if it exists) is called the whitening operatorof s since Ls = w .

3.5.5 Innovation processes

In a certain sense, the most fundamental class of generalized random processes we canuse to play the role of w in the construction of Section 3.5.4 are those with independentvalues at every point in Rd [GV64, Chap. 4, pp. 273-288]. The reason is that we can thenisolate the spatiotemporal dependency of the probabilistic model in the mixing operator(U§ in Figure 3.1), and attribute randomness to independent contributions (innovations)at geometrically distinct points in the domain. We call such a construction an innovationmodel.

Let us attempt to make the notion of independence at every point more precise in thecontext of generalized stochastic processes, where the objects of study are, more accur-ately, not pointwise observations, but rather observations made through scalar productswith test functions. To qualify a generalized process s as having independent values atevery point, we therefore require that the random variables h'1, wi and h'2, wi be inde-pendent whenever the test functions '1 and '2 have disjoint supports.

Since the joint characteristic function of independent random variables factorizes (is sep-arable), we can formulate the above property in terms of the characteristic functional cPwof w as

cPw ('1 +'2) = cPw ('1) cPw ('2).

An important class of characteristic functionals fulfilling this requirement are those thatcan be written in the form

cPw (') = eR

Rd f ('(r )) dr . (3.12)

To have cPw (0) = 1 (normalization), we require that f (0) = 0. The requirement of positive-definiteness narrows down the class of admissible functions f much further, practicallyto those identified by the Lévy-Khinchine formula. This will be the subject of the greaterpart of our next chapter.

3.5.6 Example: Filtered white Gaussian noise

In the above framework, we can define white Gaussian noise or innovation on Rd as arandom element of the space of Schwartz generalized functions, S 0(Rd ), whose charac-

43




teristic functional is given bycPw (') = e°

12 k'k

22 .

Note that this functional is a special instance of (3.12) with f (ª) = 12ª

2. The Gaussianappellation is justified by observing that, for any N test functions '1, . . . ,'N , the randomvariables h'1, wi, . . . ,h'N , wi are jointly Gaussian. Indeed, we can apply Proposition 1 toobtain the joint characteristic function

cP'1:'N (ª) = exp

√

°12

∞

∞

∞

∞

∞

NX

i=1ªi'i

∞

∞

∞

∞

∞

2

2

!

.

By taking the inverse Fourier transform of the above expression, we find that the randomvariables h'i , wi, i = 1, . . . , N , have a multivariate Gaussian distribution with mean 0 andcovariance matrix with entries

Ci j = h'i ,' j i.

The independence of h'1, wi and h'2, wi is obvious whenever '1 and '2 have disjointsupport. This justifies calling the process white. 9 In this special case, even mere ortho-gonality of '1 and '2 is enough for independence, since for '1 ?'2 we have Ci j = 0.

From Formulas (3.9) and (3.10), we also find that w has 0 mean and “correlation function”cw (r , s) = ±(r °s), which should also be familiar. In fact, this last expression is sometimesused to formally “define” white Gaussian noise.

A filtered white Gaussian noise is obtained by applying a continuous convolution (i.e.,LSI) operator U§ : S 0 ! S 0 to the Gaussian innovation in the sense described in Section3.5.4.

Let us denote the convolution kernel of the operator U : S ! S (the adjoint of U§) byh. 10 The convolution kernel of U§ : S 0 !S 0 is then h_. Following Section 3.5.4, we findthe following characteristic functional for the filtered process U§w = h_ §w :

cPU§w (') = e°12 kh§'k2

2 .

In turn, it yields the following mean and correlation functions

mU§w (r ) = 0,

cU§w (r , s) =°

h §h_¢

(r ° s),

as expected.

9. Our notion of whiteness in this book goes further than having a white spectrum. By whiteness, we meanthat the process is stationary and has truly independent (not merely uncorrelated) values over disjoint sets.

10. Recall that, for the convolution to map back into S , h needs to have a smooth Fourier transform, whichimplies rapid decay in the temporal or spatial domain. This is the case, in particular, for any rational transferfunction that lacks purely imaginary poles.

44



Appendix B

Positive definiteness

Positive-definite functions play a central role in statistics, approximation theory [Mic86,Wen05], and machine learning [HSS08]. They allow for a convenient Fourier-domainspecification of characteristic functions, autocorrelation functions, and interpola-tion/approximation kernels (e.g., radial basis functions) with the guarantee that the un-derlying approximation problems are well-posed, irrespective of the location of the datapoints. In this appendix, we provide the basic definitions of positive definiteness andconditional positive definiteness in the multidimensional setting, together with a reviewof corresponding mathematical results. We distinguish between continuous functionson the one hand, and generalized functions on the other. We also give a self-containedderivation of Gelfand and Vilenkin’s characterization of conditionally-positive general-ized functions in one dimension and discuss its connection with the celebrated Lévy-Khinchine formula of statisticians. For a historical account of the rich topic of positivedefiniteness, we refer to [Ste76].

B.1 Positive definiteness and Bochner’s Theorem

Definition 26. A continuous, complex-valued function f of the vector variable ! 2 Rd issaid to be positive semi-definite iff.

NX

m=1

NX

n=1ªmªn f (!m °!n) ∏ 0

for every possible choice of !1, . . . ,!N 2 Rd , ª1, . . . ,ªN 2 C and N 2 N. Such a function iscalled positive definite in the strict sense if the quadratic form is greater than 0 for allª1, . . . ,ªN 2C\{0}.

In the sequel, we shall abbreviate “positive semi-definite” by positive-definite. This prop-erty is equivalent to the requirement that the N £N matrix F whose elements are given by[F]m,n = f (!m °!n) is positive semi-definite (or, equivalently, nonnegative definite), forall N , no matter how the!n are chosen.

The prototypical example of a positive-definite function is the Gaussian kernel e°!2/2.

To establish the property, we express this Gaussian as the Fourier transform of g (x) =

193



B. POSITIVE DEFINITENESS

1p2º

e°x22

NX

m=1

NX

n=1ªmªne°

(!m°!n )2

2 =NX

m=1

NX

n=1ªmªn

Z

Re°j(!m°!n )x g (x) dx

=Z

R

NX

m=1

NX

n=1ªmªne°j(!m°!n )x g (x) dx

=Z

R

Ø

Ø

Ø

Ø

Ø

NX

m=1ªme°j!m x

Ø

Ø

Ø

Ø

Ø

2

| {z }

∏0

g (x)|{z}

>0

dx ∏ 0

where we made use of the fact that g (x), the (inverse) Fourier transform of e°!2/2, is pos-

itive. It is not hard to see that above the argument remains valid for any (multidimen-sional) function f (!) that is the Fourier transform of some nonnegative kernel g (r ) ∏ 0.The more impressive result is that the converse implication is also true.

Theorem 25 (Bochner’s Theorem). Let f be a bounded continuous function on Rd . Then,f is positive definite if and only if it is the (conjugate) Fourier transform of a nonnegativeand finite Borel measure µ

f (!) =Z

Rdejh!,r iµ(dr ).

In particular, Bochner’s theorem implies that f is a valid characteristic function—that is,f (!) = E{ejh!,xi} =

R

Rd ejh!,xiPX ( dx) where PX is some probability measure on Rd —ifand only if f is continuous, positive definite with f (0) = 1 (cf. Section 3.4.3 and Theorem3).

Bochner’s theorem is also fundamental to the theory of scattered data interpolation, al-though it requires a very slight restriction on the Fourier transform of f to ensure positivedefiniteness in the strict sense [Wen05].

Theorem 26. A function f : Rd ! C that is the (inverse) Fourier transform of a non-negative, finite Borel measure µ is positive definite in the strict sense if there exists an openset E µRd such that µ(E) 6= 0.

In particular, the latter constraint is verified when f (r ) = F°1{g }(r ) where g 2 L1(Rd )is continuous, nonnegative and nonvanishing. This kind of result is highly relevant toapproximation and learning theory: Indeed, the choice a strictly positive-definite inter-polation kernel (or radial basis function) ensures that the solution of the generic scattereddata interpolation problem is well defined and unique, no matter how the data centers aredistributed [Mic86]. Here too, the prototypical example of a valid kernel in the Gaussian,which is (strictly) positive definite.

There is also an extension of Bochner’s theorem for generalized functions that is due toLaurent Schwartz. In a nutshell, the idea is to replace each finite sum

PNn=1 ªn f (!°!n) by

an infinite one (integral)R

Rd '(!0) f (!°!0) d!0 =R

Rd '(!°!0) f (!0) d!0 = h f ,'(·°!)i,which amounts to considering appropriate linear functionals of f over Schwartz’ classof test functions S (Rd ). In doing so, the double sum in Definition 26 collapses into ascalar product between f and the autocorrelation function of the test function' 2S (Rd ),which leads to

('§'_)(!) =Z

Rd'(!0)'(!0 °!) d!0.

Definition 27. A generalized function f 2S 0(Rd ) is said to be positive-definite if and onlyif, for all ' 2S (Rd ),

h f , ('§'_)i ∏ 0.

194



B.2. Conditionally positive-definite functions

It can be shown that this is equivalent to Definition 26 in the case where f (!) is continu-ous.

Theorem 27 (Schwartz-Bochner Theorem). A generalized function f 2S 0(Rd ) is positive-definite if and only if it is the generalized Fourier transform of a nonnegative temperedmeasure µ; that is,

h f ,'i= h f ,'i=Z

Rd'(r )µ(dr ).

The term “tempered measure” refers to a generic type of mildly-singular generalized func-tion that can be defined by the Lebesgue integral

R

Rd '(r )µ(dr ) < 1 for all ' 2 S (Rd ).Such measures are allowed to exhibit polynomial growth at infinity subject to the restric-tion that they remain finite on any compact set.

The fact that the above form implies positive definiteness can be verified by direct substi-tution and application of Parseval’s relation, by which we obtain

h f , ('§'_)i= h f , |'|2i=Z

Rd|'(x)|2µ(dr ) ∏ 0,

where the measurability property against S (Rd ) ensures that the integral is convergent(since |'(x)|2 is rapidly decreasing).

The improvement over Theorem 25 is that µ(Rd ) is no longer constrained to be finite.While this extension is of no direct help for the specification of characteristic functions, ithappens to be quite useful for the definition of spline-like interpolation kernels that resultin well-posed data fitting/approximation problems. We also note that the above defini-tions and results generalize to the infinite-dimensional setting (e.g., the Minlos-Bochnertheorem which involves measures over topological vector spaces).

B.2 Conditionally positive-definite functions

Definition 28. A continuous, complex-valued function f of the vector variable ! 2 Rd issaid to be conditionally positive-definite of (integer) order k ∏ 0 iff.

NX

m=1

NX

n=1ªmªn f (!m °!n) ∏ 0

under the conditionNX

n=1ªn p(!n) = 0, for all p 2¶k°1(Rd )

for all possible choices of !1, . . . ,!N 2 Rd , ª1, . . . ,ªN 2 Cd , and N 2N, where ¶k°1(Rd ) de-notes the space of multidimensional polynomials of degree (k °1).

This definition is also extendable for generalized functions using the line of thoughtthat leads to Definition 27. To keep the presentation reasonably simple and to makethe link with the definition of the Lévy exponents in Section 4.2, we now focus onthe one-dimensional case (d = 1). Specifically, we consider the polynomial constraintPN

n=1 ªn!mn = 0,m 2 {0, . . . ,k ° 1} and derive the generic form of conditionally positive-

definite generalized functions of order k, including the continuous ones which are ofgreatest interest to us.

The distributional counterpart of the kth-order constraint for d = 1 is the orthogonalitycondition

R

R'(!)!m d!= 0 for m 2 {0, . . . ,k°1}. It is enforced by restricting the analysis to

195




the class of test functions whose moments up to order (k °1) are vanishing. Without lossof generality, this is equivalent to considering some alternative test function Dk' = '(k)

where Dk is the kth derivative operator.

Definition 29. A generalized function f 2 S 0(R) is said to be conditionally positive-definite of order k iff. for all ' 2S (R)

D

f , ('(k) §'(k)_

)E

=D

f , (°1)k D2k ('§'_)E

∏ 0.

This extended definition allows for the derivation of the corresponding version of Boch-ner’s theorem which provides an explicit characterization of the family of conditionallypositive-definite generalized functions, together with their generalized Fourier transform.

Theorem 28 (Gelfand-Villenkin). A generalized function f 2 S 0(R) is conditionallypositive-definite of order k if and only if it admits the following representation over S (R):

h f ,'i= h f ,'i=Z

R\{0}

√

'(x)° r (x)2k°1X

n=0

'(n)(0)n!

xn

!

µ( dx)+2kX

n=0an

'(n)(0)n!

, (B.1)

where µ is a positive tempered Borel measure on R\{0} satisfyingZ

|x|<1|x|2kµ( dx) <1.

Here, r (x) is a function in S (R) such that (r (x)°1) has a zero of order (2k + 1) at x = 0,while the an are appropriate real-valued constants with the constraint that a2k ∏ 0.

Below, we provide a slightly adapted version of Gelfand and Vilenkin’s proof which is re-markably concise and quite illuminating [GV64, Theorem 1, pp. 178], at least if one com-pares it with the standard derivation of the Lévy-Khinchine formula, which has a muchmore technical flavor (cf. [Sat94]), and is ultimately less general.

Proof. Since≠

f , (°1)k D2k ('§'_)Æ

=≠

(°1)k D2k f , ('§'_)Æ

, we interpret Definition 29 asthe property that (°1)k D2k f is positive-definite. By the Schwartz-Bochner theorem, thisis equivalent to the existence of a tempered measure ∫ such that

D

(°1)k D2k f ,'E

=D

f , (°1)k D2k 'E

= h f , x2k'i=Z

R'(x)∫( dx).

By defining ¡(x) = x2k'(x), this can be rewritten as

h f ,¡i=Z

R

¡(x)

x2k∫( dx) = h f , ¡i,

where ¡ is a test function that has a zero of order 2k at the origin. In particular, this

implies that lim≤#0R

|x|<≤¡(x)x2k ∫( dx) = ¡(2k)(0)

(2k)! a2k where a2k ∏ 0 is the∫-measure at point x =0. Introducing the new measure µ( dx) = ∫( dx)/x2k , we then decompose the Lebesgueintegral as

h f ,¡i=Z

R\{0}¡(x)µ( dx)+a2k

¡(2k)(0)(2k)!

, (B.2)

which specifies f on the subset of test functions that have a 2kth-order zero at the origin.To extend the representation to the whole space S (R), we associate to every' 2S (R) thecorrected function

¡c(x) ='(x)° r (x)2k°1X

n=0

'(n)(0)n!

xn (B.3)

196



B.2. Conditionally positive-definite functions

with r (x) as specified in the statement of the theorem. By construction, ¡c 2 S (R) andhas the 2kth-order zero that is required for (B.2) to be applicable. By combining (B.2) and(B.3), we find that

h f ,'i=Z

R\{0}¡c(x)µ( dx)+a2k

¡(2kc (0)(2k)!

+2k°1X

n=0

'(n)(0)n!

h f ,r (x)xni.

Next, we identify the constants an = h f ,r (x)xni and note that¡(2k)c (0) ='(2k)(0). The final

step is to substitute these together with the expression (B.3) of ¡c in the above formula,which yields the desired result.

To prove the sufficiency of the representation, we apply (B.1) to evaluate the functional

h f , ('(k) § '(k)_

)i= h f , x2k |'(x)|2i=Z

Rx2k |'(x)|2µ( dx)+a2k |'(0)|2 ∏ 0,

where we have used the property that the derivatives of x2k |'(x)|2 are all vanishing at theorigin, except the one of order 2k, which equals (2k)! |'(0)|2 for x = 0.

It is important to note that the choice of the function r is arbitrary as long as it fulfillsthe boundary condition r (x) = 1+O(|x|2k+1) as x ! 0, so as to regularize the potentialkth-order singularity of µ at the origin, and that it decays sufficiently fast to temper theTaylor-series correction in (B.3) at infinity. If we compare the effect of using two dif-ferent tempering functions r1 and r2, the modification is only in the value of the con-stants an , withan,2 ° an,1 = h f ,

°

r2(x) ° r1(x)¢

xni. Another way of putting it is that thecorresponding distributions f1 and f2 specified by the leading integral in (B.1) will onlydiffer by a (2k ° 1)th-order point distribution that is entirely localized at x = 0; that is,f2(x)° f1(x) = P2k°1

n=0an,2°an,1

n! ±(n)(x), owing to the property that a2k is common to bothscenarios, or, equivalently, that the difference of their inverse Fourier transforms f1 andf2 is a polynomial of degree (2k °1).

Thanks to Theorem 28, it is also possible to derive an integral representation that is thekth-order generalization of the Lévy-Khintchine formula. For a detailed treatment of themultidimensional version of the problem, we refer to the works of Madych, Nelson, andSun [MN90a, Sun93].

Corollary 5. Let f (!) be a continuous function of ! 2R. Then, f is conditionally positive-definite of order k if and only if it can be represented as

f (!) = 12º

Z

R\{0}

√

ej!x ° r (x)2k°1X

n=0

(j!x)n

n!

!

µ( dx)+2kX

n=0an

(j!)n

n!

where µ is a positive Borel measure on R\{0} satisfying

Z

Rmin(|x|2k ,1)µ( dx) <1,

where r (x) and an are as in Theorem 28.

The result is obtained by plugging '(x) = 12ºe j!x √! '(·) = ±(·°!) into (B.1), which is

justifiable using a continuity argument. The key is that the corresponding integral isbounded when µ satisfies the admissibility condition, which ensures the continuity off (!) (by Lebesgue’s dominated-convergence theorem), and vice versa.

197




B.3 The Lévy-Khinchine formula from the point of view of generalizedfunctions

We now make the link with the Lévy-Khinchine theorem of statisticians (cf. Section 4.2.1)which is equivalent to characterizing the functions that are conditionally positive-definiteof order one. To that end, we rewrite the formula in Corollary 5 for k = 1 under the addi-tional constraint that f1(0) = 0 (which fixes the value of a0) as

f1(!) = a0 +a1j!° a2

2!2 + 1

2º

Z

R\{0}

°

ej!x ° r (x)° r (x)j!x¢

µ( dx)

= a1j!° a2

2!2 +

Z

R\{0}

°

ej!x °1° r (x)j!x¢

v(x) dx

where v(x) dx = 12ºµ( dx), r (x) = 1+O(|x|3) as x ! 0 and limx!±1 r (x) = 0. Clearly, the

new form is equivalent to the Lévy-Khintchine formula (4.3) with the slight differencethat the bias compensation is achieved by using a bell-shaped, infinitely-differentiablefunction r instead of the rectangular window 1|x|<1(x).

Likewise, we are able to transcribe the generalized Fourier-transform-pair relation (B.1)for the Lévy Khintchine representation (4.3), which yields

h fL°K,'i= h fL°K,'i

=Z

R\{0}

°

'(x)°'(0)°x 1|x|<1(x)'(1)(0)¢

v(x) dx +b01'

(1)(0)+ b2

2'(2)(0). (B.4)

The interest of (B.4) is that it uniquely specifies the generalized Fourier transform of aLévy exponent fL°K as a linear functional of '. We can also give a “time-domain” (orpointwise) interpretation of this result by distinguishing between three cases.

1) Measurable Lévy density v 2 L1(R)

Here, we are able to split the leading integral in (B.4) into its subparts, which results in

fL°K(x) = v(x)°±(x)µ

Z

Rv(a)d a

∂

+±0(x)µ

b01 °

Z

|a|<1av(a) da

∂

+±00(x)b2

2.

The underlying principle is that the so-defined generalized function will result in thesame measurements as (B.4) when applied to the test function'. In particular, the valuesof '(n) at the origin are sampled using the Dirac distribution and its derivatives.

2) Non-measurable Lévy density (v › L1(R)) with finite absolute momentR

R |a|v(a) da <1

To ensure that the integral in (B.4) is convergent, we need to retain the zero-order correc-tion. Yet, we can still pull out the third term which results in the interpretation

fL°K(x) = F.P.°

v¢

+±0(x)µ

b01 °

Z

|a|<1av(a) da

∂

+±00(x)b2

2,

where F.P. stands for the finite part operator that implicitly implements the Taylor-seriesadjustment that stabilizes the scalar-product integral hv,'i.

198



B.3. The Lévy-Khinchine formula from the point of view of generalized functions

3) Non-measurable Lévy density and unbounded absolute momentR

|R |a|v(a) da =1

Here, we cannot split the integral anymore. In the particular case whereR

|a|>1 |a|v(a) da <1, we can stabilize the integral by applying a full first-order Taylor-series correction. Thisleads to the finite-part interpretation

fL°K(x) = P.F.°

v¢

+b1±0(x)+ b2

2±00(x),

which is the direct counterpart of (4.5). ForR

|a|>1 |a|v(a) da =1, the proper pointwise in-terpretation becomes more delicate and it is safer to stick to the distributional definition(B.4).

The relevance of those results is that they properly characterize the impulse response ofthe infinitesimal semigroup generator G investigated in Section 9.7. Indeed, we have thatg (x) = G{±}(x) =F { f }(x), which is the generalized Fourier transform of the Lévy exponentf .

199



Appendix C

Special functions and asymptotics

C.1 Modified Bessel functions

The modified Bessel function of the second kind with order parameter Æ 2 R admits theFourier-based representation (NIST Handbook)

KÆ(!) =Z

R

e°j!x

(1+x2)|Æ|dx.

It has the property that KÆ(x) = K°Æ(x). A special case of interest is K 12

(x) =°

º2x

¢

12 e°x .

The small scale behavior of KÆ(x) is KÆ(x) ª °(Æ)2

° 2x

¢Æas x ! 0. In order to determine the

form of the variance-gamma distribution around the origin, we can rely on the followingexpansion which includes a few more terms:

KÆ(x) =x°Æµ

2Æ°1°(Æ)° 2Æ°3°(Æ)x2

Æ°1+O

°

x4¢∂

+xÆµ

2°Æ°1°(°Æ)+ 2°Æ°3°(°Æ)x2

Æ+1+O

°

x4¢∂

.

At the other end of the scale, its asymptotic behavior is

KÆ(x) ªr

º

2xe°x as x !+1.

C.2 Gamma function

Euler’s gamma function constitutes an analytical extension of the factorial function n! =°(n +1). It is defined by the integral

°(z) =Z+1

0t z°1e°t dt ,

which is convergent for Re(z) > 0. The definition can be extended to the complex planeby analytical continuation. The gamma function also admits the well-known product de-composition

°(z) = e°∞0z

z

1Y

n=1

≥

1+ zn

¥°1ez/n

201



C. SPECIAL FUNCTIONS AND ASYMPTOTICS

where ∞0 is the Euler-Mascheroni constant. The above allows us to derive the expansion

° log |°(z)|2 = 2∞0Re(z)+ log |z|2 +1X

n=1

µ

logØ

Ø

Ø

1+ zn

Ø

Ø

Ø

2°2

Re(z)n

∂

,

which is directly applicable to the likelihood function associated with the Meixner distri-bution. The poly-gamma function of order m is defined as

√(m)(z) = dm+1

dzm+1 log°(z).

Also relevant to that context is the integral relation

Z

R

Ø

Ø

Ø

°(r2+ jx)

Ø

Ø

Ø

2ejzx dx = 2º°(r )

µ

12cosh z

2

∂r

for r > 0 and z 2C, which can be interpreted as a Fourier transform by setting z =°j!.

C.3 Symmetric-alpha-stable distributions

The SÆS pdf of degreeÆ 2 (0,2] and scale parameter s0 is best defined via its characteristicfunction

p(x;Æ, s) =Z

Re°|s0!|Æej!x d!

2º.

Alpha-stable distributions do not admit closed-form expressions, except for the specialcasesÆ= 1 (Cauchy) and 2 (Gauss distribution). Moreover, their absolute moments of or-der p, E{|X |p }, are unbounded for p > Æ, which is characteristic of heavy-tailed distribu-tions. We can relate the (symmetric) ∞th-order moments of their characteristic functionto the gamma function by performing the change of variable t = (s0!)Æ, which leads to

Z

R|!|∞e°|s0!|Æ d!= 2

Z1

0

s°∞°10

Æt∞°Æ+1

Æ e°t dt = 2s°∞°1

0 °≥

∞+1Æ

¥

Æ. (C.1)

By using the correspondence between Fourier-domain moments and time-domain de-rivatives, we use this result to write the Taylor series of p(x;Æ, s0) around x = 0 as

p(x;Æ, s0) =1X

k=0

s°2k°10

ºÆ°

µ

2k +1Æ

∂

(°1)k |x|2k

(2k)!, (C.2)

which involves even terms only (because of symmetry). The moment formula (C.1) alsoyields a simple expression for the slope of the score at the origin, which is given by

©00X (0) =°

p 00X (0)

pX (0)=

°° 3Æ

¢

s20°

° 1Æ

¢

.

Similar techniques are applicable to obtain the asymptotic form of p(x;Æ, s0) as x tendsto infinity [Ber52, TN95]. To characterize the tail behavior, it is sufficient to consider thefirst term of the asymptotic expansion

p(x;Æ, s0) ª 1º°(Æ+1)sin

≥ºÆ

2

¥

sÆ01

|x|Æ+1 as x !±1, (C.3)

which emphasizes the algebraic decay of order (Æ+1) at infinity.

202


an introduction to sparse stochastic processeslototsky/math606/bookch1-3.pdf · an introduction to...

Documents