simulation of random processes and rate-distortion theory

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 1, JANUARY 1996 63

Simulation of Random Processes and Rate-Distortion Theory

Kossef Steinberg and Sergio Verdii, Fellow, IEEE

Abstruct- We study the randomness necessary for the simulation of a random process with given distributions, in terms of the finite-precision resolvability of the process. Finite-precision resolvability is defined as the minimal random-bit rate required by the simulator as a function of the accuracy with which the distributions are replicated. The accuracy is quantified by means of various measures: variational distance, divergence, Ornstein, Prohorov and related measures of distance between the distributions of random processes. In the case of Ornstein, Prohorov and other distances of the Kantorovich-Vasershtein type, we show that the finite-precision resolvability is equal to the rate-distortion function with a fidelity criterion derived from the accuracy measure. This connection leads to new results on nonstationary rate-distortion theory. In the case of variational distance, the resolvability of stationary ergodic processes is shown to equal entropy rate regardless of the allowed accuracy. In the case of normalized divergence, explicit expressions for finite- precision resolvability are obtained in many cases of interest; and connections with data compression with minimum probability of block error are shown.

Index Terms- Shannon theory, rate-distortion theory, data compression, resolvability, simulation complexity, variational distance, divergence, Ornstein distance, Prohorov distance.

I. INTRODUCTION AND SUMMARY OF RESULTS

A. Finite-Precision Resolvability

HE artificial generation of random processes with pre- T scribed distributions arises in problems such as speech synthesis, texture generation, noise simulation, etc. Any al- gorithm used to generate a random process can be viewed as a deterministic mapping of a source of purely random (independent equally likely) bits into sample paths. Han and Verdti [1] posed the problem of finding the resolvability of a random process, defined as the minimal number of random bits required per generated sample so that the finite dimensional distributions of the generated process converge to those of the desired random process. It is shown in [l] that (if convergence is defined in the sense of vanishing variational distance) resolvability is equal to the sup-entropy rate, a quantity which is equal to the conventional entropy rate in the special case of stationary ergodic processes. Moreover, it is shown in [ 11 that resolvability is equal to the minimum achievable fixed-length source coding rate for any finite-alphabet process.

Manuscript received December 23, 1993; revised August 16, 1995. Y. Steinberg is with the Department of Electncal Engmeenng, Ben-Gunon

S. Verdu is with the Department of Electrical Engineering, Princeton

Publisher Item Identifier S 0018-9448(96)00032-6.

University of the Negev, Beer-Sheva 84105, Israel.

University, Princeton, NJ 08544 USA.

The problem studied in this paper is suggested by the results cited above: find the minimal randomness necessary to generate a process with a given nonvanishing bound on the approximation error tolerated between the desired and the generated finite-dimensional distributions. This fundamental limit, which we refer to as the jinite-precision resolvability characterizes the degree with which the distributions of the process can be derandomized, while distorting them no more than a given bound. Finite-precision resolvability is clearly of interest in the simulation of continuous-alphabet random sources. Moreover, even in cases where arbitrarily good approximations are feasible one may wish to reduce the simulation complexity below the sup-entropy rate at the expense of lower accuracy.

Not surprisingly, the finite-precision resolvability depends on the the way the approximation error is defined. Partic- ularizing the results of this paper, in Fig. 1 we show the finite-precision resolvability function for a source of purely random bits according to four different measures of similarity between the n-dimensional generated distribution Q" and desired distribution P" :

Variational Distance:

&(Q", P") = c aE{O,l}n

= 2 max (P"(A') - Q"(A')). A' C { 0,

Normalized Divergence:

Ornstein 's Pn Distance: the smallest expected fraction of discrepancies between two n-tuples generated with distributions Q" and P", respectively. Prohorov Distance: the smallest D 2 0 such that two binary n-tuples can have distributions Q" and P", respectively, such that the probability of the event {fraction of discrepancies between both n-tuples is greater than D } does not exceed D.

B. Resolvability and Rate-Distortion Theory The finite-precision resolvability of the Bernoulli- $ source

with the Ornstein (and Prohorov) distance is equal to 1 bit minus the binary entropy function of D if D < $, and 0

processes) are found in Section 11. General definihons of approximation measures (for not necessarily binary

0018-9448/96$05.00 0 1996 IEEE

64 E E E TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO 1, JANUARY 1996

Resolvabrlriy

Precision

Fig. 1. Omstein and Prohorov distance (lower).

Finite-precision resolvability for a source of pure random Dits with respect to variational distance (upper), normalized divergence (middle),

otherwise (Fig. 1). Note that this is the rate-distortion function R ( D ) of the Bernoulli-; source with a Hamming distortion metric. As we will see, this coincidence is far from accidental. Take a data compression code that comes close to achieving R ( D ) , and whose codewords are equiprobable. When those codewords are input to their corresponding decoder, the resulting binary process differs from the original Bernoulli-; process in a fraction of (roughly) D symbols. Therefore, the output of the cascade of data-compression encoder and decoder is indeed an approximating process whose Omstein distance from the Bernoulli-l rocess is equal to 0. Unfortunately, the Bernoulli-; process itself is the input to the cascade, so if we were to use this scheme as the random number generator we would need a randomness rate equal to 1 bit. However, recall that the 2nR(D)Sn7 codewords are equiprobable. Thus they can be substituted by a source of purely r;andom bits operating at rate R ( D ) + y. Thus the finite-precision resolvability cannot be larger than R(D) . Conversely, the entropy of the generated process is greater or equal than the iinput-output mutual information of any binary channel whose input is Bernoulli-; and whose bit error-rate i s D. According to rate-distortion theory [17], the minimum of such mutual informations is well-known to equal R(D) .

As hinted by this simple example, there is a strong connection between finite-precision simulation of random processes and rate distortion-theory, in the same way that [I] found a strong connection between noiseless source coding and arbitrarily accurate approximation of source statistics. This connection exists notwithstanding the fact that the purpose of source coding with a fidelity criterion is to approximate sample paths while getting rid of as much randomness as possible, whereas the purpose of resolvability is to1 approximate distributions while generating as little randomness as possible. The link between both theories arises when tlhe distance between distributions is of the Kantorovic~--Vasersilltein type [lo], [1 11. Such a distance between distributions is defined for any given metric between pairs of sample paths, by taking the infimum of the expectation of the metric over all joint distributions whose

2.p

marginais are equal to the desired and generated distributions. It can be seen that the Omstein distance cited above is of the Kantorovich-Vasershtein type. Instead of the expectation, an alternative measure of the ?ize of a positive random variable is the smallest E > 0 such that the probability that the random variable exceeds e is less than E. This leads to the aforementioned Prohorov distance measure.

A main result of this paper (Section 111) is that the finite- precision resolvability defined with a Kantorovich-Vasershtein distance for a given metric is equal to the rate-distortion function defined with that metric. (In the previous example of a Bernoulli-; source, the metric used was the Hamming distance). Our approach to show this result is as follows. For the achievability part, the problem is more complicated than what might have been surmised from the justification of the result in the above example. The reason is that, in general, the codewords generated by the rate-distortion encoder are not equiprobable, and, thus they cannot be substituted directly by a source of purely random bits. The way [l] circumvents this issue is to define equiprobable distributions on collections that include repetition of elements. For example, unlike a truly equiprobable distribution, this enables the accurate approximation in variational distance of the probability masses of the typical sequences of a stationary ergodic source. In the proof of the direct part in Section 111-C we avoid the explicit construction of a simulator by invoking the result of [1] in order to approximate with arbitrary accuracy the distribution of the codewords of a minimum distortion encoder constructed by random coding as in the conventional proof of the direct rate-distortion theorem. The general converse follows the same simple idea we outlined above for the Bernoulli-; source. Similar arguments are used with the Prohorov metric. For stationary ergodic processes the Prohorov finite-precision resolvability is shown to equal that obtained for the Kantoro- vich-Vasershtein distance defined with the same additive metric.

The variational distance is a special type of Kantoro- vich-Vasershtein distance, obtained when the sample path

STEINBERG AND VERDO: SIMULATION OF RANDOM PROCESSES AND RATE-DISTORTION THEORY 65

metric is equal to 1 if the sample paths are different, and 0 if they are equal. In Section IV we show that one half the minimum variational distance equals the limit of one minus the cumulative distribution function of the normalized entropy density. In the stationary ergodic case, this means that the finite-precision resolvability is equal to the entropy rate for any allowed variational distance less than 2 (cf. Fig. 1)-a phenomenon that does not occur with the other (less stringent) approximation measures studied in this paper. In particular, this is illustrated by the fact that no random process with entropy rate H < H ( X ) comes within variational distance D < 2 of process X : pick H < G < H ( X ) and a source code with cardinality exp Gn and vanishing error probability for any candidate approximating random process with entropy rate H ; the probability of the complement of that code approaches 0 under the approximating process and 1 under the desired process X.

C. Approximation with Normalized Divergence In addition to variational distance, [ l] investigated the diver-

gence between the approximating and desired '%dimensional distributions divided by n. In the case of the Bernoulli-; process, the entropy of the generated process is equal to 1 bit minus the normalized divergence; thus the finite-precision resolvability at distance D is equal to 1 - D (Fig. 1).

The result for other processes is more interesting. One way to derandomize a distribution is to raise the value of its masses to a > 1 (and normalize).* By choosing the value of a, the entropy of the resulting distribution can be made to be any desired fraction of the original entropy. We show in Section V that this strategy is optimum for approximating a large class of processes under divergence contraints. Finite- precision resolvability with respect to divergence is related to the exponent with which the probability of error of codes with rates below entropy goes to 1 [3]; however, it seems to be unrelated to classical rate-distortion theory.

D. Nonergodic/Nonstationary Rate-Distortion Theory

Following the tradition started in [l], we prove our results for general sources (not necessarily ergodic or stationary). Fur- thermore, the sample-path fidelity criteria that we allow in the definition of the Prohorov and Kantorovich-Vasershtein measures are very general and include nonsubadditive, context- dependent metrics. The resulting expressions for the finite- precision mean-resolvability are rate-distortion functions defined in terms of the supinformation rate introduced in [l]. In Section VI, we show the data-compression operational characterizations of those rate-distortion functions, namely, the maximal source coding rates with bounded average and maximal distortion. The proof of those operational characterizations does not require the conventional assumptions of stationarity and ergodicity.

The problem of source coding with respect to a fidelity criterion for stationary (nonergodic) sources has been studied

2A govemment that would raise the value of every individual net worth to the power cy is a lefvright-wing govemment depending on whether a is to the lefvright of 1.

by several authors in the past [4], [5], [7]. In all those works a distortion-rate approach is taken, where (in contrast to the rate-distortion approach taken here) one fixes the rate of the code and minimizes the distortion. It is shown in [4] (see also [5]) that for stationary sources and additive distortion measures, the distortion-rate function D( R ) equals the average of distortion-rate functions De(R) of the members XB in the ergodic decomposition of the source. (The average is taken with respect to the ergodic decomposition.) The distortion-rate function De(R) of each of the members XQ is given by the usual formula, i.e., minimization of average distortion under constraint on the mutual information rate.

The distortion-rate function of stationary sources is investigated in detail in [7], for the special case of additive Hamming fidelity criterion. In particular, it is shown that the distortion- rate function admits the representation as the infimum of the average (with respect to the ergodic decomposition) of De(Re), where the infimum is taken over all mappings 0 H Re such that the average of Re is equal to R. An interesting connection to Ornstein's d distance is also demonstrated in [7]: it is shown that for a stationary source X , D(R) equals the infimum of the d distance between X and Y over all Y whose entropy rate is less than R. In those works, the ergodic decomposition theorcm plays a central role; although it is shown that the ergodicity assumption is not needed, ergodic properties are crucial in the proofs and used through the ergodic decomposition theorem and the classical distortion- rate results for stationary and ergodic sources.

In this paper we are able to further generalize rate distortion coding theorems to nonstationary sources owing to the use of the approach introduced in [l]. However, this should be viewed more as a bonus than as the main contribution of this paper, which is to establish a new operational meaning for the rate-distortion function: the complexity of the random number generation necessary to approximate the distributions of a source with given accuracy.

11. PRELIMINARIES

Let A, B be finite sets. We denote by Ml(A) the set of all probability distributions on A. A source X with alphabet A is a sequence PX = { P x - ( . ) } ~ ~ I of finite dimensional distributions P X n E Ml(A"). Throughout, a source will be denoted by X and Px, interchangeably. Similarly, a channel W y j ~ with input alphabet A and output alphabet B is a sequence of conditional distributions {Wyn IXn (.l.)}n21 such that Wy-p-(.la") E Ml(B") for every a" E A". Given a source PX and a channel WYiX, we use the notation W y l x P ~ to denote the joint source whose finite dimensional distributions are Wyn lxn P X n .

For any two distributions Px- E Ml(A"), Pun E Ml(B") , we denote by 'P(Px-, Pyn) the collection of all distributions Q p Y - E MI(A" x B") having P p , and P p , as marginals, and we use the same notation for sources; thus P(Px, P y ) stands for the collection of all sources QXu such that Q p y n E P ( P p , P p ) for all n 2 1.

A distortion measure on A" x B" is any nonnegative mapping pn: A" x B" --f R+ with the property that for

66 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 1, JANUARY 1996

every an E A" there exists at least one element b" E B" such that pn(un, b") = 0. Similarly, a dlistortion measure on Ml(A") x Ml(B") is a nonnegative ma,pping rn: M1(An) x M l ( B n ) -+ IR+ such that every distribution in M1(An) has at least one zero-distortion distribution in M1(Bn).

All logarithms in this paper have an arbitrary base greater than 1 and exp (.) refers to that base; exp, ( e ) refers to the natural base.

Definition 1 [lJ: The resolution R (P) of a distribution P on A is the minimum logM such that I' is an M type (i.e.. the masses it assigns to elements of A are multiples of l/M). If such M does not exist, R ( P ) = CO.

DeJnition 2: Let PX be a process with alphabet A and let r = {rn(. , .)},>I be a sequence of distortion measures on MI (A") x Ml(B"). R is a D-achievable resolution rate of X if for every y > 0 there exists

PY { P Y ~ : P Y ~ E M1(Bn)}n>l

such that for all sufficiently large n

(1) 1 -R(Pyn) 5 R + 7 n

and

limsupr,(PX-, Pyn) 5 D. (2) n'cc

Definition 3: The finite-precision resolvability of X is defined as the infimum of the D-achievable resolution rates of X and is denoted by S ( D , X ) .

Definitions 2 and 3 were given in [l] for the special case where r, is the variational distance on A" in which case S ( D , X ) was denoted in [l] as So(X) anid will be denoted in the sequel as S, (D, X). Note that we do not put any structure on the sequence of distortion measures ?- =:: {rn(- , .)}. In many cases of interest B = A and r is a sequence of metrics on MI (A"). We define now the metrics/distalrtion measures used throughout this work.

Definition 4, e.g., (31: The variational distance or 11 distance between distributions P and Q on A is

Definition 5 [8], 191: Let pn be a metric on A". The corresponding Prohorov distance dp( P X n , PY-) between two distributions Pxn, Pyn on A" is defined as

The infimum over P(Pxn, Py") is actualy a minimum [9]. Dejinition 6 (171: Let P p , Pyn be two distributions on

A". The normalized divergence iD(Pxn IIPy-) of Px- relative to Py- is defined as

In the folowing definitions, a sequence of distortion measures { p n ( . , . ) } , y l is used to construct corresponding distortion measures between sources. Definition 7 is ju distance (e.g., [lo]) with the only exceptions that p n ( . 1 need not be metrics and Px, PY are arbitrary sequences of distributions. h Definition 8, the expectation operation used in the definition of p is replaced by limsup in probability.

Dejnition 7, e.g., (101: Let {pn(., .)}",I be given. The pn distortion measure between two distributions Pxn, Pyn is defined as

where EQ stands for expectation according to Q, The p distortion between sources Px, Py is defined as

p(Px ,Py) = limsupp,(PXn, Pyn). n'cc

The infimum in (3) is always achieved [12, Theorem 10.4.11. DeJinition 8: The ps distortion measure between two

sources Px, P y is defined as the infimum over P(Px , P y ) of the limsup in probability of pn(Xn , Y") /n , i.e.

where r

p ( Q x u ) = inf h: lim Qxny- { n+oo

1 anbn: -pn(an,bn) > h

Note that since Qxnyn need not be a marginal of Q;?, the infimum over P(Px, P y ) is always achieved. This is proved in Appendix I.

We intentionally did not put any restriction on the sequence {pn( . , .)}, so that a full analogy with rate-distortion theory will be kept. However, of special interest is the c the alphabets are identical ( A = B ) and p,(., .) is a pseudometric on A", for every n 3 1. It is shown in Appendix I that if this is the case, then ps is a pseudometric on the space of d l sequences of finite-dimensional distributions {

A few words on the choice of these accur are now in order. The variational distance has proven to be a fruitful measure of approximation error in [l], which shows applications of resolvability with respect to variational distance in noiseless source coding and in identification via channels. The d distance introduced by Ornstein [13] and its generdization p consider processes to be close if paths "look alike" according to a measure of dis sample space. This apparently natural concept of found applications to problems in information theory involving approximations [lo] and in robust statistics [14]. Consider, for example, the following problem: we are given a codebook C, and a sequence of distortion measures {p,}, and we wish to determine the average distortion resulting by using CN to encode a stationary and ergodic source Px. A evaluation of the average distortion is sometimes i and a common practice is to simulate the random source, use the codebook CN to encode the simulator output, and

STEINBERG AND VERDO: SIMULATION OF RANDOM PROCESSES AND RATE-DISTORTION THEORY

-

61

then compute the empirical average distortion between the encoder output and the simulator output. What is the minimal complexity of the source simulation scheme so that the average distortion predicted with this procedure will reside within a given distance from the exact (unknown) average distortion? Denote by ~ ( C N 1 P x ) the average distortion resulting by using CN to encode the symbols emitted from Px. It is shown in [lo] that for any pair of stationary sources Px, P y and any codebook C N ,

( P ( C N I PX) - p(cNIPY) l 5 P ( p X , p Y ) .

Thus if we want the empirical average distortion to reside within distance S from the exact unknown average distortion, it is enough to make sure that the statistics of the source simulator P y satisfy ~ ( P x , P y ) 5 S. The minimal complexity for doing this is the source resolvability S(S, X) according to the p distortion measure. Further applications of the Prohorov distance and the p distance in robust statistics can be found in [14], [15]. Examples of evaluation of p can be found in [lo].

By its definition, the p distortion is the minimal possible average per-letter distortion between sample paths of X and Y. Clearly, in the general nonergodic case low p distortion does not guarantee that there exists a joint source Qxy such that the (random) per-letter distortion lim sup pn(Xn, Y") /n is low; it can be larger than p with positive probability. This fact is the reason for the introduction of the ps distortion. If pS(Px ,Py) < D, we are assured that there exists Qxv E P(Px, P y ) according to which the probability that p n ( X n , Y n ) / n 2 D vanishes with blocklength n.

Intuitively, if {pn( . , .)} is a sequence of additive distortion measures (so that p,/n is a sample mean) and X , Y are stationary and ergodic, ps should agree with p . Lemma 12 of Appendix 1 states that this is indeed true, under the assumption that p,/n is bounded.

Note that in contrast to d,, dp and p , the ps distortion measure is not defined pointwise in n and thus Definition 2, as is, does not apply to ps. The next definition is a slight modification of Definition 2 that holds separately for the ps distortion measure.

Definition 9: Let PX be a process with alphabet A and let {pn( . , .)}">I be a sequence of distortion measures on A" x Bn. R is a D-achievable ps resolution rate of X if for every y > 0 there exists P y such that for all sufficiently large n

1 -R(PY-) 5 R + n

and

PS(PX,PY) I D. ( 5 )

Once a D-achievable resolution rate according to ps measure is defined, the finite-precision resolvability of X according to ps is defined, as with respect to other measures, as the infimum over all D-achievable resolution rates of X . In a slight abuse of notation, we will denote by Sv(D,X) , S,(D,X), S,(D,X), Sd(D,X), and S(D,X) the finite- precision resolvability of X according to the e,, Prohorov, p , normalized divergence and ps distortion measures, respec-

tively. The following relations between the various resolvability functions hold.

Lemma 1: a) If p,/n is bounded uniformly in n, then S , (D ,X) 5

b) Let pn be a metric on A" for every n 2 1 and let d, be the corresponding Prohorov distance as defined in Definition 5. Then

S ( D , X).

1) S p ( D , X ) 5 S(D,X) 2) S,(D,X) I S,(JO,X) 3) pn/n 5 pmax for some pmax < 00 and sufficiently

large n, then

s p ( D , X ) I Sp(D(I+ pmas),X).

Pro08 Parts a) and b l ) are immediate to verify. Part b2) follows from a corresponding relation between the metrics:

~:(P.x~,PY~) 5 Pn(Pxn,P~n)

stated and proved in [14]. Similarly, b3) follows from the relation[ 141.

Pn(Pxn,Pyn) I dp(Px-,PY")(l+ pmas) 0

We proceed now to the definition of the relevant information-theoretic functions.

Definition 10 111: Given a joint distribution Pxnyn on A" x B" with marginals Px, P y the information density is the function

Pxnyn (u"b") Pxn (u")Pyn (b")

iXnyn(~"; b") = log

WY n IX" (b" I U " )

PY"(b") . ' = log

The distribution of the random variable i xnyn ( X " ; Y") /n is referred to as the information spectrum of Pxnyn, and the expected value of the information spectrum is the normalized mutual information I ( X " ; Y " ) /n . The mutual information rate of X Y is defined as

1 I ( X ; Y ) = lim - I ( X " ; Y " )

n+cc n

provided the limit exists. Unlike the mutual information rate, the following concept is always defined.

Definition 11 [I]: The sup-information rate I ( X ; Y) of the joint process XY is defined as the limsup in probability of the sequence of random variables i",(X"; Yn) /n; i.e.

r ( X ; Y ) = inf h: lim PXnyn { n+cc

Analogously, the inf-information rate I ( X ; Y ) of the joint process XY is the liminf in probability of ixnyn ( X " ; Y")

68 EEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 1, JANUARY 1996

If T ( X ; Y ) = I ( X ; Y ) then ([I])

1

n-m 12 Iim -I(x"; Y") = 7(X; Y )

and the joint process X Y is called information-stable. In case that X is equal to Y , r ( X ; X ) (respectively, I ( X ; X ) ) is referred to as the sup (respectively, inf) entropy rate of X and is denoted by H ( X ) (respectively, H ( X ) ) .

Definition 12 [2/: Given a joint distribution

P X n y n ( ~ " , b") = P~n(a")Wyn~Xn(b"la")

the conditional entropy density is the fuinction

iynlX"(b" I a") = - l o g W y n , p p I an) The distribution of the random variable i i y n p p (Y" I X n )

where X " , Y" have joint distribution PXnyn is referred to as the conditional entropy spectrum, and the expected value of the conditional entropy spectrum is the normalized conditional entropy $ H ( Y n 1 X") .

Definition 13 [2]: Let

P X Y = { PxnWyn/xn}n>1

be given. The conditional sup-entropy rate H ( Y IX) of Y given X is defined as the limsup in probability of the sequence of random variables

-

i.e.

Analogously, the conditional infentropy rate l3(Y I X ) is the liminf in probability of the sequence

i.e.

1 . anbn: ; z y n p n ( b n I a")<h

In case that the input process X is deterministic, or in case that Y" is independent of X" for every n, H(YIX) and H ( Y I X ) coincide with the sup-entropy rate z ( Y ) and infentropy rate W ( Y ) of Y as defined in [l], respectively. In the sections to follow we shall make use of a few properties of entropy rates, which generalize the corresponding familiar properties of entropy. These are stated in the next lemma.

Lemma 2: a) W(Y I X ) 2 H ( Y 1 X ) 2 0. b) z(Y) 5 log (BI, where B is the alphabet of Y .

d) Conditioning reduces sup-entropy; i.e., H ( Y 1 X ) 5 c) H ( Y ) - Z(Y I X ) I: qx; Y ) 5 Z(Y) - - Ef(Y I X ) .

B(Y).

Proof: Parts a) and c) are immediate to verify. Part b) follows from the fact that Z(Y) is the minimal achievable block coding rate of Y [l]. Part d) follows from the H ( Y I X ) is the minimal achievable block coding rate of Y with side information X [2], and hence cannot be larger than

In the next sections we show that the finite-precision resolvability of X is equal to the infimum of I ( X ; Y ) over an

.appropriate class of channels. This can be viewed as a "rate- distortion counterpart'' of the channel resolvability results in [l], where it is shown that for a given channel W Y I X , the supremum of 7 ( X ; Y ) over all input processes X is the minimal number of random bits per channel use needed to generate any input so that the output statistics is arbitrarily close to the desired one.

-

- H ( Y ) - 0

m. .&PROXIMATION WITH OWSTEIN, PROHOROV, AND RELATED DISTANCES

A. Approximation in ps Sense and Sup Rate-Distortion Function

In this section we state and prove the finite-precision resolvability result with respect to the ps distortion measure. We start with the definition of the appropriate rate-distortion function.

Dejinition 14: Let PX be a process with alphabet A and let {pn(.,s)}n2~ be a sequence of distortion measures on A" x B". The sup rate-distortion function R(D) is defined as

where I(D) is the class of all joint processes Qxu with X marginal Px, such that

P(Qxy) 5 D.

( p ( Q x u ) is defined in (4)). Theorem 1:

S ( D , X ) = Z ( D ) .

In the proof of the achievability part, we will make use of

Lemma 3: Let Px ,Py be sources such that the following lemma.

and

Then R is a D-achievable resolution rate of X.

Y such that Pro08 By [l] we know that for every S > 0 there exists

1 -R(Pyn) 5 R + S n

69 STEINBERG AND VERDU: SIMULATION OF RANDOM PROCESSES AND RATE-DISTORTION THEORY

If d,(&-, PY-) < 6, then there exists a joint distribution Pynp." such that

Choose A 4 = exp (nR) n-blocks independently, according to Qy. Denote this random set by C. For given C we denote by A(C) the set of sequences an E A" such that there exists

Pynpn(Y" # Y") < - 6 2 (7) b" E C with

(see [16]). Now, let P x y ='{Px-y-} be the joint source that satisfies (6). Let P,,,,pn be any joint distribution having Px-y-, Pyn+ as marginals, and define

PXnpn(an, 6") = Pxnynpn(an,P,bn).

We claim that the variational distance between PXnpR and Px-y. is less than or equal to S. To see this, let A be any subset of A" x B". P,ynpn(A) can be bounded as follows

P E B "

PXnyn(A) = PXnynpn(~"bnLn: anin E A) = PxnY,p, (anbnbn: anbn E A, anin E d)

+ Pxnynp.- (anbnLn: anbfi E A", anin E A) 6 PX"Y"(A) + Pynp,(Y" # Y " ) (8)

For every an E A", let V&IX-(.~U") be a distribution that gives mass 1 to an element b" E C that minimizes ,on ( a", b"), and let Vyn be the Y" marginal of VX-Y. = Vy-1xnPx-. Since for every realization of C the corresponding Vya is a distribution over a (super) alphabet of size exp (nR), the resolvability of the sequence {Vy-} is at most R (i.e, ~ ( V Y ) < R). In view of Lemma 3 it suffices to show that the distance between Px and VY is close to D. Now, for a given C

1 6 D + S + f limsupPxn(AC(C)) [ n-oo

and the roles of Y, Y in (8) can be interchanged. Thus (7) and (8) imply that

~ , (Px -Y- , P,,p,) = 2 max (Px-Y- (A)

where f(u) = O for and f(.) = CO, Otherwise. Thus we have to show that the average of Px-(Ac(C)) over all relizations of C goes to 0 as n -+ CO. This will imply that there exists at least one sequence of realizations of C with

=

A C A - x B -

lim sup P p (A"(C)) = 0. n-cc

From this point, the proof follows the lines of classical rate- distortion arguments. Indeed

such that

lim d,(P,,p,, Px-p) = 0 n-cc

and since P x y achieves (6), this implies that E Q ~ - Px- (A"(C)) = QY- (e> Px-(a") C a " e 4 C ) PS(PX,PY) 5 D. 0

= Px-(a") Q y - ( C ) (9) C:an '$A(C) anEAn

Proof of Theorem 1: Directpart (achievability).. w e use a random selection

argument to prove achievability of every pair (R, D + 6) for 6 > 0, R > B(D). Fix D > 0, and let Q x y be a joint source that achieves equality in the definition of %(D). (Note

such a joint source always exists.) Let Qy be its Y marginal (Qx = Px) . Define the sets

where the last sum in the right-hand side of (9) is the probability of choosing c that does not represent the specific a" within an error D -I- 6. Define

(10)

The probability that a single element chosen randomly according to Qy- does not represent a fixed an within an error D + S i s

1, if (an , b") E An,6 c 0, otherwise. that since a source is an arbitrary sequence of distributions, K(a", b") =

1 anbn : ;Pn(an, b") < ~ ( Q - x y ) + E ,

1 n --ix-y-(a"; b") < f ( X ; Y ) + t

- where I ( X ; Y ) is according to Qxy and hence equals to R ( D ) . We have

P1) lim QX=yn(An,,) = 1

P2) if anbn E A,,,, then

n-oo

QYnIXn(bn I a") = exp{i~-y~(a";b")}Q~~(b")

5 QY-(bn)exp{n(z(X;Y) + E ) } .

Q y = ( ( a n , Y n ) @An,&) = & ~ - ( K ( a " , y " ) = 0) = 1 - C Q y n ( b " ) K ( ~ " , b " )

bn

and thus the probability to choose independently exp (n&) words so that an is not represented within error D + 6 is

70 EEE TRANSACTIONS ON INFORMATION THEORY, VOL,. 42, NO. 1, JANUARY 1996

which implies Definition 15: The Prohorov sup rate-distortion function &(D) is defined as

inf I ( X ; Y ) BQyn p X n (A"(C))

exp(nR) ' ( D l = Q E V ( D )

where V ( D ) is the class of all joint processes Q X Y with X = Pxn (U,") 1 - Qyn (~")K(u ," , b")

an E A n ( bn marginal Px, such that for every E > 0

5 PX- ( a n ) 1 - exp { -n(?(X; Y ) + S) } )

1 Q x m y m ~ " b " : -pn(an, b") > D + E 5 D + E

exp(nR) (11) L a" ( x Qyn/xn (b" I an)K(an , b"))

bnx for all sufficiently large n.

finite-precision resolvability of X . The next theorem characterizes &(D) as the Prohorov

Theorem 2: where the inequality is due to P2). Using the inequality

(1 - zy)" 5 1 - IG + e-yn, 0 5 z,y 5 1 S,(D,X) = Z P ( D ) .

in (11), we obtain

EQyn P X n (A"(C)) the following As in Theorem 1, the proof of Theorem 2 makes use of

Lemma 4: Let Px, P y be sources such that I 1 - PXn(a")QynIX-(b" I u,")K(u,",~") an b"

+ exp, {-exp [n(R - R(D) - S)]} = 1 - Qx-Y- (&,s) and

+ exp, {-exp [n(R - R(D) - S)]}. (12)

Accordmg to Pl), QXnyn(An,s)goes to 1 as n t 00. Now, S is arbitrary, and hence if R > R(D) it can always be chosen so that the right-hand side of (12) vanishes as n t 00. This proves the direct part.

Converse part: We show that if P y is a distribution with

then

g(Y) <_ R

Then for every S > 0, R is a ( D + &)-achievable Prohorov resolution rate of X .

Prooj5 The proof follows the lines of that of Lemma 3 and is omitted. U

Proofof Theorem 2: The proof follows the lines of the proof of Theorem 1 with a few variations.

Achievability: We show achievability of every pair ( R , D + 46) for every R > B p ( D ) , 6 > 0. Let Qxy be a sequence that achieves equality in the definition of f z , (D) , and let Q y be its Y marginal. (Qx = Px.) Define the sets

1 A' = ~,"b": - p n ( ~ " , b n ) < D + E ,

n+ { n

--Zxnyn . ( U " .b") , < I ( X ; Y ) + e) (13) n

- where ? ( X ; Y ) is according to Qxy and hence equals to &(D). We have

This will imply the converse since the limit of normalized resolution of any process is lower-bounded by its sup-entropy rate, by the definition of these quantities. Let I 'xy be the sequence that achieves the inf in the definition of pS(Px ,Py) . Clearly, its marginals are Px, P Y , and it also satisfies ~ ( P x Y ) 5 D. Now

' H ( Y ) 2 H ( Y ) - Ef(Y I X ) 2 qx; Y ) Pl') liminf Qxnyn(Ak,,) 2 1 - D - E .

P2') For every anbn E

n-o3 2 inf f ( X ; Y ) = R ( D ) QXY Qx=Px,p(Qxu)<D

which is the desired result. U

B. Approximation in Prohorov Metric In this section we prove the corresponding result with the

Prohorov metric. We assume throughout that the approximating process Y has the same alphabet as that of the source X ( A = B), and that ,on(., .) is a metric on A", for every n.

The first step is to define the counterpart of the sup rate- distortion function, in the Prohorov sense.

Q Y - ( ~ " ) 2 Q Y ~ I X ~ ( ~ " 1 a") . exp { -n(r (X; Y ) + E ) } .

Choose M = exp (nR) n-blocks independently, according to Qyn and denote this (random) set by C. Construct the set A(C) and the distribution VXnyn exactly as described in the proof of Theorem 1.

Now we shall make use of the following property of the Prohorov distance. Let Q X n y n be a distribution on-An x A"

STEINBERG AND VERDU: SIMULATION OF RANDOM PROCESSES AND RATE-DISTORTION THEORY

~

71

with marginals Qxn, Qyn. From Definition 5 it follows that for every subset 'A C: A" x A"

d,(Qx-, Q Y ~ ) L max { ;pn(an, b"), QxndA")} .

(14)

In view of (14), the Prohorov distance between Vyn and the

1

source distribution PXn satisfies '

d,(Px-,Vy-) 5 max{D + 5 , Px-(Ac(C))}.

Thus in view of Lemma 4 it remains to show that the average of Pp(A"(C) ) over all realizations of C is arbitrarily close to D + 26 (as n + CO).

Define now the function K(an, b") as in (lo), but with of (13) replacing A,J there. Repeating the arguments

that lead to (12), we conclude that

EQynPXrL(Ac(C)) zz 1 - Q x " Y ~ ( A ~ , ~ ) + exp, { -exp [n ( R - &(D) - s)] }.

According to Pl'), QXnyn(AL,h) is at least 1 - D - S for n large enough. Thus

E Q y n p X n (A"(C)) 5 D + 26 + exp, {-exp [n(R - z , (D) - S)]}. (15)

If R > B p ( D ) , we can always choose S > 0 such that the exponential term in (15) vanishes. This proves the achievability Part-

Converse: We show that if Py is a source with

l imsupdp(Pxn,Pyn) 5 D (16) n-ioo

then

P(Y) 2 Z,(D). By (16), there exists a joint source Pxy with marginals Px, P y , such that for every E > 0

1 Pxnyn ~ " b " : -pn(un,b") 2 D +

( n

for all sufficiently large n. Now

where the last inequality holds since, by (17) and the definition 0

- Remark: Note that 7 ( D ) C V ( D ) , and hence %,(D) 5 R ( D ) , in accordance with Lemma 1.

of the set V ( D ) , Pxy E V ( D ) .

C. Approximation in the p Sense

In this section we prove a result analogous to Theorem 1, but in the p sense rather than ps. Although the notion of sup-distance seems more natural than average distance when dealing with general sources, the full analogy with classical rate-distortion theory is given by the p-resolvability.

Definition 16: The p sup rate-distortion function associated with X is defined as

where U ( D ) is the class of all joint sources Qxy having PX as marginal, such that

The main result of this section is the characterization of R,-(D) as the p finite-precision resolvability of X , under the assumption that p,/n is bounded uniformly in n.

Theorem 3: Let {p,(., .)}n>l be a sequence of distortion measures on A" x B" satisfying

-

1 max -pn(an, b") < pmax < 00 an,bn n

for all sufficiently large n. Then

S,-(D,X) = B,-(D).

As in Theorems 1 and 2, we make use of Lemma 5: Let Px, P y be sources such that

- H ( Y ) 5 R

and

limsupp,(Pxn, Pyn) 5 D.

Then for any S > 0 R is a ( D + 6)-achievable p resolution rate of X .

Proo$ The proof follows the lines of the proof of Lemma 0

The definitions of the ps and the Prohorov distances guarantee that the limits as n + 00 of the probability of large approximation errors are properly bounded (by 0 in the ps distance and by D in the Prohorov distance). This fact com- pensates for the possible nonergodicity of the sources and allows one to define "typical distortion sets" (An,t, and to bound their probabilities. Concentrating on these sets, the achievability part is established by showing that the average (over all realizations of C) of the probability of large approximation error is properly bounded-by 0 or by D. This strategy is inefficient here since p < D does not imply much about probabilities of higher distortions. Thus the definition of the typical set in the proof of Theorem 3 does not involve bounds on distortion, and the direct part is established by directly bounding the average distortion over all realizations of c.

n-cc

3 making use of the fact that p,/n is bounded.

Proof of Theorem 3: Direct part: We show achievability of every pair (R , D+

26) for S > 0 , R > B P ( D ) . Fix S > 0, and let &XU be a joint source that achieves equality in the definition of R,-(D). Define the typical set

1 . < I ( X ; Y ) + S

72

For every x > 0, define further

1 n

A,,s(x) = { anbn E G,J: -pn(un,bn) 5 z

The properties analogous to Pl), P2) are P3) lim Q p y n ( G n , h ) = 1.

P4) For every x > 0 and every anbn E An,s(z)

n 4 m

Q Y - x - ( ~ " I an) 5 Q Y " ( ~ " )

. exp {n(T(X; Y ) + 6)). Pick R > 0. Choose exp(nR) n-blocks independently, according to Qyn and denote this random set by C. Construct the distribution VXnyn as described in the proof of Theorem 1. Now

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 1, JANUARY 1996

This bound on probability of distortion higher than x can be used to bound Ec,V;,pn/n as follows:

1 n Pn(Px-, VY") L EVX"," -Pn(Xn, Y")

The distribution Vxnyn is determined by C. Using Lemma 5, it suffices to show that

1 n

EC,vxnyn-pn(X",y") 5 D + 6 for all sufficiently large n, where E c , ~ , , y n stands for expectation with respect to both, Vxnyn and the sets C. Define

1, if (anibn) E An,6(x) 0, otherwise. Kz(a", b") =

The probability to choose a single element Y" such that an is not within distance x is bounded as

Q Y ~ (:pn(a", n Y") > x 5 Q Y ~ ( ( u " , Y") 6 A,,s(z))

bn E Bn

Therefore, using P3), P4) (note that the code C is independent of X")

1 b n E C n

= Qyn (C: min -pn(an,bn) > x

C: min -pn(X",bn) > 2 I X" = U"

1 b n E C n

+ exp, {-exp [n(R - T ( X ; Y ) - S ) ] } dx

Now, QXu E U ( D ) and therefore

1 limsup pn(Xn, Y") 5 D.

n+oo 72

Since R > Z,(D), one can always pick positive 6 such that R - z p ( D ) - 6 > 0, and the exponential term vanishes. By P3), Qpp(G:,,) also vanishes as n i 00. This completes the proof of the direct part.

Converse: The proof of the converse follows the lines of the proofs of converse part in Theorems 1 and 2 and is omitted here. 0

demonstrate the connections between S ( D , X ) , and S,(D, X ) for simple processes, where these resolvability functions can be computed. Here we concentrate on the computation of these functions; general results concerning the connection between these resolvability functions for stationary and ergodic processes are given in Section VI.

Example I- p s and p Finite-Precision Resolvability: Let pn be the additive Hamming distance on A", i.e.

We conclude this section with a few examples

n

pn(a", 6") = dF(u2, 6,) 2=1

where d F is defined in (26). Let X be an i.i.d. Bernoulli process with parameter 19. In Theorem 10 it is shown that for any process X , S,(D, X ) equals the minimal achievable block source coding rate at distortion D with respect to the fidelity criterion {pn}n21. Therefore, in view of known rate distortion


~

13

results for discrete memoryless sources ([3], [17]) we have

S ( D , X ) = S , ( D , X ) h(8) - h ( D ) , 0 5 D 5 min(8,l- S}

D > min(8,'I - e } (18)

where

h(8) = -8logB - (1 - O)log(l- e) . Example 2: The simplest case where S ( D , X ) differs from

S , ( D , X ) is a compound source. Thus let X be an i.i.d. Bernoulli process with parameter 8, where 8 is a random variable taking values in { 1/4,1/2} with a priori probability P(6' = 1/2) = p , and pn as in Example 1. Using no more then h(1/4) - h(D) bits per sample we can assure that the average distortion is no larger then p/2 + (1 - p)D. But, in order that ps (Px ,Px) 5 p/2 + (1 - p)D, we must use at least h(1/2) - h(p/2 + (1 - p ) D ) bits per sample. Clearly, for p small enough

S(P/2 + (1 - P)D, x> 2 h(1/2) - h(p/2 + (1 - p ) D ) > h(1/4) - h ( D ) 2 S,(P/2 + (1 - P ) D , X ) .

Example 3-Prohorov Finite-Precision Resolvability of an i.i.d. Process: Let X be a Bernoulli process with parameter 8, and let pn be the additive Hamming distance, as in Example 1. Since X is information-stable, the objective now is to supremize H ( X 1 Y ) over all joint processes P x ~ having Px as marginal and satisfying

V€ > 0.

Define 1 n

E, = -pn(Xn,Yn) .

Now, it is easy to verify that

- H ( X I Y ) + H ( E I X , Y ) L H ( E , X I Y ) I: H ( X I Y) + H ( E I X , Y )

and since H ( E I X , Y ) = H(E 1 X , Y ) = 0 (this follows from the fact that for every n, En is a deterministic function of X n , Y n ) , we have

(19) - H ( E , X 1 Y ) = H ( X 1 Y ) .

- H(X 1 E , Y) + H ( E I Y ) 5 H(E, XIY)

On the other hand

-- < H ( X 1 E l y ) + H ( E I Y )

which implies, by Lemma 2 and the fact that E, takes only n values for any n, that

- H ( E , X 1 Y ) = H ( X I E , Y ) . (20)

From (19) and (20) we get

H ( X I Y) = H ( X I E l y ) . (21)

We turn now to upper-bound the supremum of H ( X 1 E, Y ) subject to PX-y-(E < D + E ) > 1 - D - E . Define

1 N;(b") = {a" : ;pn(b",u") < D + E

It is easy to verify that

INe(b")l 5 exp {n[h(D + E ) + €1) for n large enough. This immediately implies that

- H ( X I E , Y ) 5 h ( D + € ) + E

and this inequality should hold for every E > 0. Thus

S , ( D , X ) 2 h(8) - h!(D).

In view of Lemma 1 and (18), we conclude that

S,(D,X) = S ( D , X ) = S,(D,X) h(8) - h ( D ) , O 5 D 5 min{O, 1 - O}

D > min(0 , l - e}. (22)

Thus it is interesting to note that for Bernoulli processes, the finite-precision resolvability is insensitive to whether we measure the accuracy according to Omstein's d distance, or the Prohorov distance defined with additive Hamming distance between sequences. Here we have shown this by using the duality between source coding and resolvability, together with classical rate-distortion results. A much more general result will be shown in Section VI: for any stationary ergodic process the finite precision resolvability is insensitive to whether we measure accuracy with p , p s or Prohorov distance.

Example 4-Prohorov Finite-Precision Resolvability of a Nonergodic Process: In this example, we compute the Pro- horov finite-precision resolvability of a Bernoulli (0) process where 0 is a random variable having some a priori distribution. The following result for fixed O is needed. It can be viewed as a strong converse for Prohorov finite precision resolvability of a Bernouli (e ) process.

Lemma 6: Let X be a Bernoulli (0) process where 6' is a fixed parameter and let pn be the additive Hamming distance on A", as in Examples 1 and 3. For every y > 0 there exists E > 0 such that for every sequence of sets C, (0, l}", n 2 1 satisfying

ICnI I exp {nS,(D + 7 , XI}

and every sequence of mappings

4": {0,1}" - c, we have

> D + E ) = 1. (23)


Proof: By contradiction. Fix y > 0, and assume otherwise that for every E > 0 there exists a sequence of sets and corresponding mappings C,, n 2 1, such that on some subsequence n E J C N

1C"I L exp {nS,(D + 7, X > >

and, for some a > 0

1 n

PXn (an: -p,(an,cp(an)) D + E ) > a , vn E J.

For n E J, the set C, can be viewed as an (n, lCnl, D + E , 1 - a ) distortion code for X . By Example 3 and the results of Section VI, we know that there exists a sequence of sets c", and mappings & from A" to c", such that

ICnI 5 exp {nS,(D + Y,X)>

and yet

It follows that on the subsequence J , .we can construct an (n, M , D') average distortion code for X with

exp {nS,(D + y,X) + I} M

and

D' 5 a ( D + E ) + (1 - a ) ( D t y + f> . (24)

Now, the right-hand side of (24) can be imade strictly smaller than D + y by taking E small (note that E is arbitrary). This contradicts Example 3, Theorem 10, and classical rate- distortion results. 0

Returning to our example, since (23) holds for any mapping &, we conclude that whenever Y is a process with resolution less than or equal to S,(D + y, X ) , we have, for some E > 0 depending only on y

for any joint process Qxu with marginals X and Y . Now, let 2 be a Bernoulli (0) process where the pa-

rameter 0 is a random variable over a finite set, with a priori distribution Po. Without loss of generality, assume that Po(@ > l / 2 ) = 0. Define

Po = min{a : Po(@ > a ) 5 D } .

We claim that

To see this, observe that for every Po, the probability of having an ergodic component of X with 0 > is less than or equal to D. Moreover, for each ergodic component X Q we can construct a simulator of complexity no larger than h(0) - h ( D ) such that the probability of having a distortion larger than D vanishes with blocklength (this follows from Example 3). Hence we have S,(D,X) 5 ~ ( P D ) - h(D) . The other direction follows from (25).

m. APPROXIMATION IN VARIATIONAL DISTANCE

The finite-precision resolvability of a source with respect to distance can be expressed as a special case of finite- precision resolvability with respect to p by using a simple and useful relation between the variational distance and the 2 distance between two distributions defined on the same set. The 2 distance between random variables X and X with distributions P and P, respectively, is defined as

where

0 , a = 6 c 1, otherwise. H d, (a,Z) =

The following key identity holds [16]:

& ( F , P ) = 22(P, P ) . (27)

Note that in Theorem 3 we did not impose any structure on the sequence of distortion measures { pn } beside boundedness of p,/n. Therefore, S v ( D , X ) is a special case of S,(D,X) with pn(an, 6") = ndF(an, 6"). We state this as a corollary.

Corollary 1: -

S, (D,X) = inf I ( X ; Y ) Q E W D )

where W ( D ) is the class of all joint processes Qxu having PX as marginal, such that

D l i m s u p Q p y n ( X n # Y") 5 -.

n t c c 2

The minimization of sup-information rate that has to be car- ried out in order to compute the resolvability functions is not an easy task. As has already been observed in rate-distortion problems, the minimization can be a difficult task even when we deal with single letter, average mutual-information. The next result shows that the variational resolvability admits a much simpler characterization, as the inverse of the limit of one minus the cumulative distribution function of the normalized self-infomation random variable log &. We first record the definition of this function.

DeJinition 17: R ( D ) is defined as the infimum over all real numbers h satisfying

w-cc

That is, it is the smallest real number h such that the mass of the entropy density to the right of k does not exceed D/2, asymptotically.

The function ( D ) has an operational meaning in fixed- length source coding, as shown in Corollary 3 of Section VI. The main result concerning variational resolvability is the following.


Theorem 4: For any process X and by (31)

d,(X", f,(Y")) 2 2 & ( O , X ) = % ( D ) .

The proof of the converse part of Theorem 4 relies on the following lemma, which gives a new lower bound on variational distance between two distributions in terms of the distribution of the log-likelihood. An upper bound in terms of such distribution was shown in [l].

> 0

infinitely often in 12, Proving the converse Part. w e Proceed to Prove the direct Part. w e Will show that for

any y > 0, %,(D) + y is a D-achievable resolution rate of x. Construct a process Y with alphabet A as follows: Lemma 7: For any two distributions p, Q and every

where X is distributed according to P. Proofi Let C c A be the support of Q. Then

I o , otherwise

where C, is a normalization factor, i.e. 1 -d,(P, Q) = max[Q(A') - P(A')] 2 Q(C) - P(C) 2 A C A

1 1 L = {a : P(u) 5 Q(u) exp ( -a)} .

To L { U " : -log ___ Then n Pxn(u")

Now, the resulting variational distance is

where the first inequality is due to (29), and the second is due 0

Observe that if Q is generated by a deterministic mapping of a random variable uniformly distributed on M elements, then on the support of Q we have Q(a) 2 A!-', and Lemma 7 implies

to (28). This proves the lemma.

1 r 1 1

implying, by the definition of 3, ( D ) , that

Proof of Theorem 4: We start with the converse. We will

per symbol, the resulting distance is larger than D. Indeed, recall that Z,(D) is the infimum over all reals h meeting the constraint

"-+cc (i Pxn(Xn) > h I-.

limsupd,(Xn,Yn) 5 D. (32) show that whenever we use less then %(D) random bits n-icc

Hence, Y is an approximating process for X. How many random bits are required to construct Y? Observe that for every n, Y" is a verctor over (super) alphabet of size at most exp (n&(D) + n z ) , therefore it can be approximated by a process Y such that

> : lim sup Pxn - log

Hence, for every y > 0 there exists ay such that

(33) 1 n - R ( P ) I % ( D ) + y

lim ~ , ( Y ' " , Y ~ ) = 0. (34) "-+cc

1 1 PXn (- log 72 Px-(X")

infinitely often in n. Let Q p be a uniform distribution on set of size exp (nx, ( D ) - 2ny). Upon setting a! = -my and

M = exp (n% (D) - 2ny) By (32) and (34)

which, with (33), proves the direct part. 1

U


Example 5-Variational Finite-Precision Resolvability of an Information-Stable Source: If X is information-stable in the sense that a ( X ) = H ( X ) (e.g., a stationary ergodic process), then its entropy rate exists, z ( X ) = H ( X ) [l], and it is equal to the variational finite-precision besolvaibility

SW(D,X) = H ( X ) (35)

for all D < 2. This means that any process with entropy strictly less than W ( X ) fails to approximate X no matter how coarse the allowed variational distance. (The extreme case S w ( 2 , X ) = 0 is trivial.) In order to show (35) the reader can use the characterization in Corollary 1, or we can simply observe that the normalized entropy density converges in probability to H ( X ) , and thus, accordmg to Definition 17

for 0 5 D < 2.

v. RESOLVABILITY WITH RESPECT TO DIVERGENCE

In this section we study the problem of finite-precision simulation of a process when the accuracy measure is the normalized divergence (Definition 6). It is not known whether it is possible to express divergence as a Kantorovich-Vasershtein distance, so the methods of this section will be different from those used in Sections I11 and IV. The Prohorov and Om- stein measures defined with appropriate sample-path distortion function are suitable approximation measures for real-valued random processes. However, we call attention to the fact that the divergence between any approximating distribution with finite randomness and the original real-valued random process is infinite. Thus the accuracy measure in this section is specifically tailored to the simulation of discrete processes.

Key conclusions of this section will be that the following two different methods by which a distribution can be derandomized are optimum as far as finite-precision simulation subject to a divergence constraint:

a) raise all the probability masses to a power a greater than 1 (and normalize)

b) keep only the M most likely probability masses (and normalize).

In addition to the resolvability S d ( D , X ) defined as in Defi- nition 11-3 (with r, equal to normalized divergence) it is very convenient and appropriate to use the slightly different notion of complexity of random process simulation called mean- resolvability which was introduced in [I]. The definition is identical except that the resoluhon in (1) is replaced by the entropy. The rationale is that this gives the fundamental limit of simulation complexity in the sense of average number of random bits (cf. [I], [181). As will be shown, both notions of resolvability lead to the same result for a large class of random processes of interest.

A. Mean Resolvability for General Proceaies In [18], Knuth and Yao introduced the notion of average

complexity of random variable generation. They showed that the minimal average number of random bits required to exactly

synthesize a random variable X lies between H ( X ) and H ( X ) + 2. As explained in [l], this suggests to consider the normalized entropy of n-tuples as measure of randomness. Specifically, if in Definition 2 we replace (1) by the requirement

then achievable resolution rates become achievable entropy rates. We denote by sd(D, X ) the minimum D-achievable entropy rate with respect to divergence.

In this section we shall take also a dual divergence-rate approach. The basic definitions follow.

Dejnition 18: D is an S-achievable distortion for a process X if for every y > 0 there exist P y such that for all sufficiently luge n

1 n -H(Y" ) I s + y (36)

and 1

limsup - D ( P p 1 1 P p ) 5 D. n-cc n

Dejnition 19: We denote by D ( S ) the infimum over all

The following theorem is a direct consequence of our

Theorem 5:

a)

S-achievable distortions for X .

asymptotic definitions and the results of [ 181.

1 S d ( D , x ) = Iimsup - inf H ( F )

n t c c n Q Y ~ E& ( U )

where Xn(D) is the class of distributions Qyn satisfying 1

Unlike classical rate-distortion theory, it is possible to solve the optimization problem that characterizes s d ( D, X ) pointwise in n. As we shall see in the sequel, the technical difficulties arise in taking the limit.

For fixed n, we have to minimize a concave function (entropy) over the convex set of probability mass functions

Since

as long as the minimizing probability mass function achieves the divergence bound, the constrained optimization problem in part a) of Theorem 5 is equivalent to the following problem:


where now we minimize a convex function over a convex region.

In this formulation, the case where Pxn is an equiprobable distribution has to be treated separately, since then (37) is independent of &U-. However, the solution in this case is immediate: if Px- is equiprobable on M elements, then

1 min - H ( Y n ) = min A H ( Y n )

;D(QYn I1 px-)lD 72 ; log M - $ H ( Y ” ) < D R

and moreover, the optimal Q is an equiprobable distribution on a subset of the support of P p , of size Mexp (-nD).

For notational convenience, from this point on we drop the dependence on n. We denote by {P3}3 the original distribution and by { Q;}3 the minimizing distribution. Using Lagrange multipliers and applying the Kuhn-Tucker conditions, we obtain that the minimizing distribution Q* satisfies

1 Q* p3 p3

log - + A 1 log 3 + A 1 + A2 2 0 v j

with equality if Q,* > 0, or

~ ~ = e x p ( - i - - - ) ~ . A2 1+* . A1 3

Therefore, the minimizing Q has the form no

For a general process X , the value of a that results in a given divergence D (i.e., given divergence value between Q;- and Pxn) may depend on n. Moreover, it is possible to construct pathological examples where the original process X is information-stable but an + 00 and the minimizing Q* is not information-stable. Therefore, although the minimization in Theorem 5 can be solved explicitly for every n, computation of the limit can be a difficult task. However, much can be deduced from (38) about the solution for special cases like i.i.d. or Markov processes. This will be done in the next subsections.

The characterization given in (37) is used in the next lemma to prove convexity (and hence continuity) of 9, ( D , X ) .

Lemma 8: For any source X, S , (D ,X) is convex in D. Pro08 The assertion is immediate for the uniform case.

For a general distribution, using a characterization similar to (37) we obtain

. log+ - a D - (1 - a)D’ 3

.log 1 - a D - (1 - a)D’ p3

where we have used convexity of divergence in the inequality. This shows that S d ( D , X ) is a limit of convex functions, and hence convex. 0

The continuity of &(D, X) implies that the rate-divergence and divergence-rate approaches yield the same characterization. This is stated in the next theorem.

Theorem 6: For any process X

S d ( D , X ) = min { S : D ( S ) 5 D } .

B. Mean-Resolvability for Stationary Memoryless Sources The solution given by (38) leads immediately to the finite-

precision mean-resolvability of an i.i.d. source: Theorem 7: The finite-precision mean-resolvability S d ( D ,

X) of an i.i.d. source with distribution P on a finite-alphabet A is equal to log JAJ - D if P is equiprobable and otherwise it is given by the following parametric form:

1 W A X ) = aC p3” log - p3 + (1 - a)R,(P) (39)

3

where R,(P) is the RCnyi entropy of order a

3

and a ranges from 1 to CO.

Proof: Follows immediately from (38) upon noticing that Q* is a product distribution if P is a product distribution. 0

Example 6-Finite-Precision Divergence Mean-Resolvabil- ity of a Bernoulli ( p ) Source: In the case of a Bernoulli source, the solution in Theorem 7 particularizes to the parametric form

where h(.) and T , ( . ) are the binary entropy and binary Rtnyi entropy of order a, respectively. This parametric solution is plotted in Fig. 2 for various values of p.

C. Mean-Resolvability for Markov Sources

In this subsection we examine the solution of the optimization problem for the special case where X is a homogenous stationary Markov chain. Set

n

Px-(a”) = P(a1) n W(Ui I ai-1) 2=2

where W is an irreducible Markov transition matrix and P is its stationary distribution. Then the optimization problem in (37) particularizes to

1 inf -G(Q”) - Q” n D (42)

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 1, JANUARY 1996 78

Resolvability

DiverRence per sample

Fig. 2. Finite-precision resolvability with respect to normalized divergence for a Bemoulli-p source, for p = 0.1,0.3,0.47, from bottom to top.

where the infimum is over the class of Qfl satisfying

G(Qn) - nD 5 H(Q7') (43)

and

n 7

We claim that if Q*" achieves the infimum in (42), then so does its Markov approximation

n

i=2

To see this note that if Qn is the Markciv approximation of Qn, then G(Qn) = G(Qn) and

n.

i=l

Therefore, in the solution of (42), (43) we can restrict attention to Markov processes. With Q Markov, it is evident that as n -+ 00, (42), (43) become

(44) where the minimum is over Q satisfying a constraint on conditional divergence

and the optimal approximating process is a stationary Markov chain whose transition matrix is given by the minimizing Q. Next, one can minimize (44) under the constraint (45) by

minimizing each of the inner sums in (44) under suitable constraints

subject to

For a general transition matrix W , this minimization problem appears to be challenging since we have to optimize over all assignments {Da,-l } and the optimization problems (46)-(48) are coupled through (49). Note that we are not free to choose the distribution Q(a,-l) in (48)-it has to be the steady-state distribution of the optimal transition Q(ur I u r - l ) , as indicated

While a closed-form parametric solution to (46)-(49) may be hard to obtain, upper and lower bounds on the mean- resolvability function can be derived. First note that if we suboptimally assign the same distortion to all letters

by (49).

then (48) is satisfied for any steady-state distribution Q(a), a E A, thus decoupling (46) and (47) from (49). For any a E A, denote by $(D, W ( . 1 U ) ) the mean-resolvability function of an i.i.d. process whose distribution is given by W ( . a) (see Section V-B). Then the mean-resolvability of the Markov process X with transition matrix W is upper-bounded by

S d ( D , X ) 5 CQ*(a)S,(D,I.1/( . I a)) (50) a

STEINBERG AND V E R D ~ : SIMULATION OF RANDOM PROCESSES AND RATE-DISTORTION THEORY 19

where &*(a) is the stationary distribution of E. Divergence Resolvability and Data Compression

and ab is chosen so that Q*(. 1 b ) achieves sd (D, W( . I b ) ) . Similarly, we can get the lower bound

A useful conclusion that can be drawn from (44), (45) is that Q*(a 1 b ) = 0 if and only if W ( u I b ) = 0. Therefore, &* preserves the irreducibility of W . In the next subsection we will make use of this observation to conclude that for Markov chains, mean-resolvability equals resolvability.

In this subsection we show a connection between resolvability subject to divergence constraint and sub-entropic data compression. In optimal data compression below the source entropy it is of interest to find the minimum rate at which the probability of error goes to 1 for a given rate R. In the case of a memoryless source, it is shown in [3, p. 411 that such a function is given by the function D(R) defined in Section V-A as the dual of mean-resolvability with respect to divergence. (Note that in the memoryless case, D(R) has a single-letter characterization.) By means of the result of Section V-D, this is a consequence of a special case of the general result shown in this subsection: the resolvability with respect to divergence is the dual of the maximum exponent of correct-decoding probability. We first record the definition of correct-decoding exponent.

DeJnition 20: Let fn, gn be an encoder-decoder pair, and denote by e(fn,gn) its probability of error. Let l f n l stand for the codebook size. The correct-decodinp exDoneat C ( S )

D. Equality of Resolvability and Mean-Resolvability

Reference [ 11 showed that inifinite-precision mean-resolv- " 1 , ,

ability and resolvability are equal for stationary ergodic is defined as sources. In this subsection, we show a class of sources for which that property holds in the more general setting of finite-precision resolvability.

Theorem 8: Let PX be information-stable, and assume that there exists an information-stable process Q y such that

1 1 C( S) = inf lim sup - log

n+cc n (1 - e ( f n , g n ) )

where the infimum is taken over all sequences { f n , gn}n21 satisfying

H ( Y ) = S d ( D , X )

and

Then S d ( D , X ) = S d ( D , X ) . Corollary 2: For any irreducible Markov chain (in partic-

ular, for any memoryless source), mean-resolvability equals resolvability.

Pro03 As we saw in Section V-C, the optimal approximating process for a homogenous stationary Markov process

0 The proof of Theorem 8 is based on the following lemma, whose proof can be found in Appendix 11.

Lemma 9: Let Px, P z be information-stable sources with entropy rates H ( X ) , H ( Z ) , respectively, and assume that

is a Markov process, which is information-stable.

H ( 2 ) < H ( X ) .

Then there exists a process V whose resolution satisfies

1 n-00 72

limsup -R(Vn) 5 H ( 2 )

and

1 1 limsup -D(Vn 11 P p ) 5 limsup - D ( P p I ( P p ) . n-m n n-cc

Proof of Theorem 8: By definition of mean-resolvability and resolvability, Sd(D,X) 5 Sd(D,X) , and the theorem follows from Lemma 9. 0

1 limsup -log l f n l 5 S.

In addition to the characterization of Sd(D, X ) , we shall take a divergence-rate approach, as given in the following definitions.

Dejinition 21: D is an S-achievable distortion for a process X if for every y > 0 there exist P y such that for all sufficiently large n

n-cc 72.

1

and 1

limsup -D(Py- 1 1 P X - ) 5 D. n-mz

Dejinition 22: We denote by Dd(S, X ) the infimum over

Recall that, for any distribution P ([l]) all S-achievable distortions for X .

H ( P ) I Ro(P) 5 R(P) (53)

where Ro ( P ) is the RCnyi entropy of order 0, which is equal to the logarithm of the support-size of P (see (41)). In view of (53), to show the connection between S d ( D , X ) , s d ( D , X ) and the correct-decoding exponent C(S) , it will be most convenient to introduce a new kind of resolvability, where the measure of simulation complexity is the Rknyi entropy of order 0. The pertinent definitions follow.

Definition 23: The zeroth-order Rknyi entropy rate Ro(Qy) of a process Qr is defined as

Ro(Qy) = limsup -log 1 IY"( n-+m

where IYn( is the support size of Qy- .


For a pair of processes Qy , Qx, we shall denote

D(Qy ( 1 Q x ) = limsup - D ( & Y ~ (I Qx-1- 1

, 100 Iz

Definition 24: The function Sd(D,-X) is defined as

S d ( D , X ) = Q Y $ , D , K ~ ( Q ~ )

where X(D) is the class of all processes Qy satisfying

D(QY llpx) i D.

Set

\ T ( S ) {Qy:Ro(Qy) 5 S}.

The divergence-rate dual of S d ( D , X ) is; defined next. Definition 25: The function f i d ( S , X ) is defined as

D d ( S I X ) = Qjt!(s) D(Qy II Px).

The following inequalities are a direct consequence of our definitions and of (53)

S d ( D , X ) 5 S d ( D , X ) I Sd(D,X) . (54)

A simple concept that plays an important role in this subsection is the following.

Definition 26: A subset C C A is called a set of maximal probability according to P if it satisfies the following property: if a E C and P(u') > P(u) then U' E (7.

In fixed-length source coding at rate B, the optimal coding strategy is to assign distinct codewords only to the elements of the maximal probability set of size exp (nR). In the event of error-free encoding, the distribution of the decoder output is given by the original distribution Pxn conditioned on a set of maximalprobability. This gives rise to the following definition.

Definition 27: For a given distribution P and maximal probability set C, the code-derandomized distribution P is defined as P conditioned on C, i.e.

P ( a ) = {p P(C)' i f a c C

otherwise.

The reader can easily verify that

H ( P ) 5 f f ( P )

R o P ) 5 RO(P) R(P) 5 R(P) .

We are finally ready to state the main result of this section. Theorem 9:

a> s ~ ( D , x ) = S ~ ( D , X ) . b) D d ( S , X ) = Dg(S,X) . c) D d ( S, X ) and Sd(D, X) are achieved by code-deran-

domized distributions. d) D d ( S , X ) = C(S) .

Prooj? We start with Part a). The inequality Sd(D, X ) 2 Sd(D,X) is immediate, from the fact that any mapping of nS random bits results in a distribution on alphabet of size exp (ns), where few of the letters can have zero mass.

As for the reverse inequality

S d ( D , X ) I S d ( D , X ) (55)

for every ~ , 7 > 0, we will constmct a process V such that for all sufficiently large n

1 n -R(Vn) 5 &(ax) + Y

and 1 -D(Vn 1 1 Px-) 5 D + E . n

To this end, we first derive lower bounds on i D ( Q y n 1 1 Pxn) for arbitrary Qyn n 2 1, and then construct a simulation scheme that achieves these bounds, asymptotically in n.

Let C, C A" be an arbitrary set and let Q p be an arbitrary distribution whose support is C,. Then

where we have used the log-sum inequality. This gives a lower bound on :D(Qy. 1 1 Pxn) in terms of the probability of the support of Qyn according to Pxn. If we choose Qyn of the form

then

1 1 = -log

72 Px-(Cn)' Therefore, for a given support set C,, the lower bound in (56) is achievable with distribution as defined in (57). Henceforth we can concentrate only on those processes Y for which, for every n, the support C, of Qyn is a maximal probability set according to Pxn, or equivalently, Q y - is a code-derandomized distribution.

The next step is to show that the lower bound (56) is achievable not only by the distribution (57), but also by a distribution whose resolvability is close to log IC,l. For convenience, set

IC,l = exp(nS).


~

81

First note that for C,

(59)

since otherwise one arrives to a conclusion that, infinitely often in n

Px-(a") < ]AI-" VU" E A".

We are going now to construct an approximating distribution that "looks like" (57), and can be exactly synthesized by nS + ny bits, for small y. Although C, is a set of maximal probability, it can have elements whose probability according to Px- is exponentially smaller than Px* ( C,) exp (-nS) (and therefore their probability according to Pyn is exponentially smaller than exp (-ns)). Of course, we would not like to approximate these probabilities since this would require too many bits. Therefore, the first thing to do is to isolate those points of C, having a too low probability.

Choose y > S > 0 arbitrarily small. Define

F, = {a" E C,: Px-(an) 2 Pxn(C,)exp(-nS - nS)}.

Then

Px-(C,) = Px-(%) + J'x-(a") < Px-(K) Cn \Fn

+exp(nS)exp(-nS - n6)E'x-(Cn).

Hence

moreover

IFnI I IC,l = . .P(4 . (61)

We construct the approximating distribution on F,. For an E F,, define

(we assume from this point on that exp (nS+ny) is an integer). Let a;, a;, . . . , ayF;, be an order of the elements in F, SO that

Px-(a:) 5 Px-(a?-l), i E {2,3,...,IFnI}

let J be the largest integer L so that L

j=1

and set J

j=1

We now define the following distribution:

V,~n(u:) = K(a:)exp(-nS - ny), 1 <_ i I J V,(a7+1) = k'exp (-ns - nr)

Vn(a;) = 0, j > J + 1.

v, can be precisely synthesized by mapping a random variable uniformly distributed over exp (nS + ny) elements. Now

K(aj") exp(-nS - ny) . log (62)

PXn (a?)

where the inequalities hold since, by definition of K(a") , the log is always positive. Every term in the right-hand side of (62) can be upper-bounded as follows:

K(a7) exp (-nS - ny) K(ajn) exp (-nS - ny) log

Px- (ay)

(63)

Summing both sides of (63) over all elements in F, and using (61), (621, we get

1 + - exp ( - 2 n ~ - iny) n,

By definition of F, and (60), (61)


for all sufficiently large n. Substituting in (64) and using (59), (60) we finally arrive at

for all > 0 and sufficiently large n, where the inequality is uniform for all Qyn and all S. Due to the uniformity, we can particularize this inequality to any sequence of distributions { Q Y ~ ) + I satisfying

Using in addition the right-hand side of (56) we get

1 n-oo n

5 limsup -D(Qyn I ( Px-) + E

b' QIr E X ( D ) , E > 0.

Therefore, we have shown that for any y > 0 and any process Q~ E X(D) we can construct an approlximating process V whose resolution is g ( D ) + y and such that D ( V ( ( P x ) is arbitrarily close to D(Qy ( 1 Px). This coimpletes the proof of Part a).

Part b) is proved along the lines of a). Part c) follows from (56) and (58). Part d) follows from a), b), and the fact that an

0 Computation of D ( S ) from its characterization is usually

difficult. However, for those processes which satisfy the conditions of Theorem 8 (e.g., stationary ergodic Markov chains)

optimal code encodes a maximal probability set.

G ( S ) = D ( S )

and the mean-resolvability functions are easier to compute. Similarly, if Sd(D, X) = Sd(D, X ) , then code-derandomized distributions achieve asymptotically the minimal divergence under entropy constraits, i.e., for those processes, keeping only the exp (nS) high probability masses (and normalizing) is asymptotically as good as raising the probabilities to a power CL. > 1 (and normalizing). Moreover, foir any y > 0, their exp (nS + nr) -type approximation achieve resolvability with respect to divergence.

VI. RESOLVABILITY AND SOURCE CODING WITH A FIDELITY CRITERION

In Section 111 we have proved that finite precision resolvability is equal to rate distortion functions in very general settings. In this section we show another operational characterization of those rate distortion functions as general formulas for the minimal achievable source coding rate subject to a fidelity criterion. In the following definitions it is assumed that the source alphabet is A, the reproduction alphabet is B, and that we are given a sequence of distortion measures between A" and B".

Definition 28: An (n , M , D ) average-distortion code for PX consists of an encoder map

f :A" + { 1 , 2 , . . . , M )

and a decoder map

g: {1,2, . . . , M } + B"

with average distortion less than or equal to D ; i.e.

(65) 1 n EPxn -Pn(X" ,g[ f (X") I ) I D.

Definition 29: R is an achievable coding rate at average distofiion D for the source X if, for all sufficiently large n, there exists an (n , exp {nR) , D) average distortion code for P p . The infimum of the achievable coding rates at average distortion D for the source X is denoted by T ( D , X ) .

Definitions 28 and 29 deal with coding of sources subject to an average distortion. The next definitions deal with achievable coding rates where the restriction is on the Probability of excessive distortion. This may be a more useful criterion when dealing with nonstationaryhonergodic sources. The analogy here is with finite-precision resolvability with respect to the ps distance.

Definition 30: An (n , M , D , t) distortion code for PX- consists of an encoder map

and a decoder map

g: {1,2,. . . , M } i B"

such that the probability of a distortion larger than D is less then E ; i.e.

1 n PXn (a": -pn(~",g[f(a")]) > D ) < E . (66)

Definition 31: R is an €-achievable coding rate at distortion D for X if for all sufficiently large n there exists an (n , exp{nR}, D , E ) distortion code for Pxn. R is an achievable coding rate at distortion D if it is €-achievable for every E > 0. The infimum of the achievable coding rates at distortion D for X is denoted by T,(D,X) .

Theorem 10: Let Px and { p n ( - , .)}+I be given.

b) If p,/n is bounded uniformly in n, then a> TS(D,X) = S ( D , X )

T ( D , X) S,-(D, X ) .

Proo? a) We first prove that T , ( D , X ) 5 S ( D , X ) . To this end, it

is enough to show that if R is a D-achievable resolution rate of X with respect to p s , then it is also an €-achievable coding rate at distortion D for X , for every E > 0. Indeed, by the

STEINBERG AND VERDO: SIMULATION OF RANDOM PROCESSES AND RATE-DISTORTION THEORY 83

definition of D-achievable resolution rate and by Lemma 10, for every y > 0 there exists a joint process Qxu such that

5 ( Y " ) 5 R + y n

~ ( Q x y ) I D.

and

We can view Y" as putting mass & on each member of a collection of M = exp (nR + ny) elements of B", denoted by C = {!I?, . . + , bs} (note that the elements of this set need not be distinct). Define a mapping 4: A" ---t C that assigns, to each a" E A", an element b" E C that minimizes pn(an, b"). Then

) 1 Qxny- ~ " b " : -pn(an, b") > D

( n 2 Qjynyn anbn: -pn(an, 1 4 ( ~ " ) ) > D ) ( n 1

= Pjy- a": -pn(.",q5(a")) > D ) . ( n

Implying that q5 consists of the desired encoder-decoder map. We turn to show that T, (D,X) 2 S ( D , X ) . We show

that if R is an +achievable coding rate at distortion D for every E > 0, then it is also a D-achievable resolution rate. By Definitions 30 and 31, for every E > 0 there exists n(c) such that for every n > n(c) there exists a pair of mappings f ; , g; such that

1 Px- (a": -p"(a",g;[ f : (a")]) > D ) < t. n

Let { E % } be a sequence converging to 0. For every n(c,) < n 5 n ( ~ % + l ) , let QynIxn (. I a") be a distribution that puts mass 1 on the element 92 [f: (a")] , and let Q y n be the Y"-marginal of Q p y . = QplxPxn. By construction the resulting joint process Qxu satisfies

- H ( Y ) 5 R

p s (Qy , Px) I D.

In view of Lemma 3, this completes the proof of part a). b) The proof follows the lines of the proof of part a) and is

omitted. 0 The results of this section and of Section IV enable us to

examine the minimal achievable block coding rate of a source where the fidelity criterion is the probability of error. This is the subject of the next corollary, a direct consequence of Theorem 10 and (27).

Corollary 3: Let T,(D, X ) stand for the minimal achievable block coding rate of X with probability of decoding erfor D. Then

Te(D,X)=S,-(D,X)=S,(2D,X)=R,(2D) VO < D < 1

with

pn(a", 6") = nd?(a", 6").

In Examples 1, 3 we saw that for Bernoulli processes, the resolvability functions with respect to Omstein, d,, and

Prohorov distance coincide. In fact, using the connection between resolvability and rate-distortion theory and strong converses for source coding due to Kieffer [6], it can be shown that for stationary ergodic processes S,, S , and S, are equal, when defined with the same metric. This is the subject of the next theorem.

Theorem 11: Let X be stationary and ergodic, and let the sequence of distortion measures satisfy

Pn+m(21,22,Yl,Y2) n t m 1 1

n + m n + m 5 - Pn(21, Yl) + - Pm(X2,Y2)

V 2 1 , Y l E A", X2,Y2 E Am. (67)

Then

S ( D , X ) = S,(D,X) = S , (D ,X) .

The proof of Theorem 11 is based on the following theorem, which is part of the strong converse results of Kieffer.

Theorem 12 [6]: Let X be stationary ergodic, let the distortion measure p n satisfy (67), and let ( f n , gn) be a sequence of codes for X. Denote by I f n [ the codebook size. Then

whenever 1

n-cc n limsup -log l f n l < T ( D , X ) .

Proof of Theorem 11: We will make use of the continuity of T ( D , X ) , implied by (67). Assume that, for some y > 0

S,(D,X) = S ( D , X ) - 27.

T ( D , X ) = T , ( D , X ) - 27.

This implies that, for this source

Let ( f n , g n ) be a sequence of codes that achieves T ( D , X ) . By definition of T,(D,X) and by continuity of T ( D , X ) , there exist cy, > cy > 0, S > 0, and a subsequence J for which

px- [ ; P n ( X " , g n ( f n ( x " ) ) 1 > D + 61 = Q" 72. E J.

This implies

contradicting the strong converse. Therefore, S,-(D, X) 2 S ( D , X ) , and by Lemma 1 we obtain

S,(D,X) = S ( D , X ) .

Now, in view of this result and part bl) of Lemma 1, to complete the proof of the theorem it is enough to show that

Sp(D,X) 2 S , ( D , X ) .

S,(D,X) = S,(D + S , X ) - y.

Assume otherwise, that there exist y > 0, 6 > 0, such that

84 lEEE TRANSACTIONS ON INFORMATION THEORY, VOL 42, NO 1, JANUARY 1996

By definition of S,(D,X), this implies that there exists a process Y and a sequence of joint distribiutions {PXny-},>1, such that

liminf PXnyn -pn(Xn, Y") 5 Di > 1 - D (68) "+oo [I, 1

and such that the alphabet of Y", denoted by CG, satisfies

1 "103 n

From (68) we have

lim sup - log IC; 1 5 S,-(D + 6, X ) - y. (69)

pn(X", y") 12 D > 1 - D n-oo 1

and in view of (69), this contradicts the strong converse. 17

APPENDIX I PROPERTIES OF THE p s DISTORTION MEASURE

In this section we state and prove some properties of the p s distortion measure. We first show that the infimum over P ( P x , P y ) is actually a minimum.

Lemma 10: For every P x , P y , there exists Q x y E P(Px , P y ) such that

P ~ ( P x , P Y ) = ~ ( Q x Y ) .

Proo$ We only have to prove that ps(Px ,Py) 2 p(Qxy) for some Q X y in P(Px ,Py) . We show it by construction. By definition of ps, there exists a sequence { E , } ~ ~ I , E , + 0, and a corresponding sequence {Q!&}, such that

(70) PS(PX,PY) 2 P(Q$L) -- €2.

By definition of p ( Q X y ) , there exists a sequence n(t,), i 2 1, such that for every n > n ( ~ , )

and from (70) it follows

( n 1

Q P Y - ant)": -pn(a", b") 2 p,(Px, PY) + 2EL

b'n > n ( ~ , ) . (71)

Construct Q x y as (2) Q x ~ Y - = Q X n y n , for n ( ~ , ) < n L n ( ~ , + ~ ) .

By (71) we conclude that for every y > 0

b") 2ps (Px , Py )+y n i o o

implying

~ ( Q x u ) 5 PS(PX,PY). 17

We next show that when the distortion measures on sequence space are metrics, the resulting p s is a pseudometric.

Lemma 11: Whenever {pn }n21 is a sequence of pseudo- metrics, ps is a pseudometric on the space of all sequences of finite dimensional distributions {PXn},> 1.

Pro08 It is straightforward to verify that pS(Px, P x ) = 0 and that ps(., .) is symmetric. It remains to show that the triangle inequality holds. Let QxYz be a joint source with marginals Px, P y , P z , all of which with alphabet A. We have

PS(PX,PZ)

1 U"C" E A" x A": -P,(u", c") > h

12

{ n-co

anbncn: $"(U", b") + ;pn(b", e") > h

{ n-oo

({ anbncn: +U", b") > hl} U

= i d h: lim Q X n y n Z n

1 n

anbncn E A3": -p"(a", e") > h

1

<inf hl +ha: lim Qxnynzn

1

The lemma follows by considering the joint source Qxyz whose XY marginal achieves ps (Px , P y ) and Y Z marginal

0 The connection between p and p s is stated in the next

lemma. We show that if Px, P y are stationary and ergodic, and p n ( - , - ) /n is bounded, then p s equals p .

Lemma 12: Let {p,.(., ~ ) } ~ > l be a sequence of additive distortion measures

achieves ps ( P r , P z ) .

n

pn(an,bn) = C p 1 ( u 2 , b , ) V n , u " E A", b" E B" 2=1

where PI ( . , .) is bounded. Then

PS(PX,PY) = P(PX,PY)

whenever X , Y are stationary and ergodic.

STEINBERG AND VERDO: SIMULATION OF RANDOM PROCESSES AND RATE-DISTOIYTION THEORY 85

Pro03 This is an immediate consequence of 112, The- orem 10.4.11. Indeed, let P, , (Px,Py) be the subset of all stationary and ergodic measures in P ( P x , P y ) . By [l2, Theorem 10.4.11

p(PX,PY) = inf EQPl(X1,Yl). Q €'Pa e (Px ,PY 1

This implies that

Ps(Px, PY) <'P(Px, PY 1. The other direction follows by the boundedness of p1.

APPENDIX I1 PROOF OF LEMMA 9

We may as well assume that iD(Pzn I( Px-) is bounded for all sufficiently large n, since otherwise there is nothing to prove. The function x log 7 is monotone decreasing in t. Therefore, if the support of Pzn is not a maximal probability set according to Px-, we can construct a distribution Qn such that

H(Qn) = H(P2n) D(Qn I I px-) < q p z - I1 px-).

Henceforth, we can concentrate on processes P Z for which, for every n, the support of Pz- is a maximal probability set according to PX- .

Choose S > 0 and define the sets

Since H ( 2 ) < H ( X ) and both processes are information- stable, lCzl is exponentially smaller than ICxl. In particular, Cz is a maximal probability set according to Pxn

Cz c cx , (73)

and

lim P p ( C Z ) = 0

lim Pz- ( C Z ) = 1.

(74)

(75) 11-03

n+oo

Now

Z, -, 0 as n -+ ca (76)

where we have used the log-sum inequality and (74), (75). Define the distribution Qy- as P p conditioned on the set CZ, i.e.

Then

and since ;D(Pzn 11 Px-) is bounded, due to (75), (76)

By way of construction of Q y - , its support is CZ and

Qy- (a".) 2 exp ( - n H ( 2 ) - 2nS) V an E Cz and sufficiently large n (78)

and

(Czl 5 exp ( n H ( 2 ) + nS). (79)

Choose y > 36. Define

Kn = exp ( n H ( 2 ) + nS + ny)].

Let Jn be a set of cardinality K,, and let U be a uniform distribution on it. Let ay . . . u p Z , be an arbitrary ordering of the elements in Cz, with the only restriction that

exp(-nH(2) - n6) 5 Q y - ( a b z I ) 5 exp(-nH(2) - nS.

With this choice

and therefore this term does not contribute to i D ( Q y n I ( P p ) , in the limit n --f ca.

Denote by B (1) . . . B (1 CZ I) bins corresponding to the or- dered elements in CZ. We now aggregate mass from Jn into the bins to construct the approximating distribution V,.

Place elements from Jn into B(1) until its probability satisfies

Q Y % ( ~ : ) 2 U(B(1)) > Q ~ n ( a ; l ) - G1.

By (78) and way of choosing y

U(B(1)) > Qyn(ay) - exp ( - n H ( 2 ) - n6 - nr) >_ Qyn(a;)[l- exp(-2nS)].

Repeat this procedure to all the bins. (When filling the jth bin, Qyn(a,") replaces Qyn(a;) in the filling rule.) Clearly, we have enough elements in Jn for all the bins, and a few may


be left in J,. The total mass left in J, is upper bounded by Substituting (81), (83), and (84) in (82), and using (80), we get

1 1 -o(Vn n (1 Px-) I ; D ( Q Y ~ ( 1 Px-)

We add this mass to the last bin. For convenience, set

As in (80), we conclude that - exp (-2n&)H(Z) + f n

lim I , = 0. n-cc

for some sequence f j n --f 0. In view of (77), this completes (81) the proof. 0

Define the following sets:

Identify the approximating distribution V, with the probabili-

ties of the bins according to U . Thus

We now turn to bound each of the terms;. Obviously

As for the second term

qn t 0 as n t ca. (84)

REFERENCES

111 T. S. Han and S. Verd6, “Approximation theory of output statistics,” IEEE Trans. Inform. Theory, vol. 39, pp. 752-712, May 1993.

[2] Y. Steinberg and S. Verdd, “Channel simulation and coding with side information,” IEEE Trans. Inform. Theory, vol. 40, no. 3, pp. 634-646, May 1994.

[3] I. Csiszir and J. Komer, Information Theory: Coding Theorems for Discrete Memorylesi Systems.

[4] R. M. Gray and L. D. Davisson, “Source coding theorems without the ergodic assumption,” IEEE Trans. Inform. Theory, vol. IT-20, no. 4, pp. 502-516, July 1974.

[5] J. C. Kieffer, “On the optimum average distortion attainable by fixed- rate coding of a nonergodic source,” IEEE Trans. Inform. Theory, vol. IT-21, no. 2, pp. 19G193, Mar. 1975.

[6] -, “Strong converses in source coding relative to a fidelity criterion,” IEEE Trans. Inform. Theory, vol. 37, no. 2, pp. 257-262, Mar. 1991.

[7] P. C. Shields, D.’L. Neuhoff, L. D. Davisson, and F. Ledrappier, “The distortion-rate function for nonergodic sources,” Ann. Prob., vol. 6, no.

New York Academic Press, 1981.

1, pp. 138-143, 1978. 181 Y. V. Prohorov, “Convergence of random processes and limit theorems .~

in probability theory,” Theor. Probability-Appl., vol. 1, pp. 157-214, 1956.

[9] V. Strassen, “The existence of probability measures with given marginals,”Ann. Math. Statist., vol. 36, pp. 423439, 1965.

[la] R. M. Gray, D. L. Neuhoff, and P. C. Shields, “A generalization of Omstein’s d distance with applications to information theory,” Ann. Prob., vol. 3, no. 2, pp. 315-328, 1975.

[l I] I. Vajda, Theory of Statistical Inference and Information. Nonvell, MA: Kluwer, 1989.

[I21 R. M. Gray, Entropy and Information Theory. New York: Springer- Verlag, 1990.

[I31 D. S. Omstein, “An application of ergodic theory to probability theory,” Ann. Prob., vol. 1, pp. 43-58, 1973.

[14] P. Papantoni-Kazakos and R. M. Gray, “Robustness of estimators on stationary observations,” Ann. Prob., vol. 7, pp. 989-1002, 1979.

[15] F. R. Hampel, “A general qualitative definition of robustness,’’ Ann. Math. Statist., vol. 42, pp. 1887-1896, 1971.

E161 R. M. Gray and D. S. Omstein, “Block coding for discrete stationary d-continuous noisy channels,” IEEE Trans. Inform. Theory, vol. IT-25, no. 3, pp. 292-306, May 1979.

[17] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991.

[18] D. E. Knuth and A. C. Yao, “The complexity of random number generation,” in Proc. Symp. on New Directions and Recent Results in Algorithms and Complexity.

’

New York: Academic Press, 1976.

simulation of random processes and rate-distortion theory

Documents