principles of embedded networked systems design by gregory j. pottie and william j. kaiser ch2

Pottie and Kaiser: Principles of Embedded Network Systems Design

2-1

2. Representation of signals Source detection, localization, and identification begin with the realization that the sources are by their nature probabilistic. The source, the coupling of source to medium, the medium itself and the noise processes are each variable to some degree. Indeed, it is this very randomness that presents the fundamental barrier to rapid and accurate estimation of signals. As such applied probability is essential to the study of communications and other detection and estimation problems. This chapter is concerned with the description of signals as random processes, how signals are transformed from continuous (real world) signals into discrete representations, and the fundamental limits on the loss of information that result from such transformations. Three broad topics will be touched upon: basic probability theory, representation of stochastic processes, and information theory. 2.1 Probability Discrete Random Variables A discrete random variable (RV) X takes on values in a finite set X={x1, x2,xm}. The probability of any instance x being xi, written P(x=xi), is pi. Any probability distribution must satisfy the following axioms: 1.

P(xi) 0 2. The probability of an event which is certain is 1. 3. If xi and xj are mutually exclusive events, then P(xi+xj)=P(xi)+P(xj) That is probabilities are always positive, the maximum probability is 1, and probabilities are additive if one event excludes the occurrence of another. Throughout the book, a random variable is denoted by a capital letter, while any given sample of the distribution is denoted by a lower case letter. Example 2.1 Binomial distribution An experiment is conducted in which an outcome of 1 has probability p and the outcome of a zero has probability 1-p. In Bernoulli trials, each outcome is independent, that is, it does not depend on the results of the prior trials. What is the probability that exactly k 1s are observed in n trials? Solution: Consider the number of ways there can be k 1s in a vector of length n, and the probability of each instance. Each 1 has probability p, and each 0 has probability (1-p), so that each vector with k 1s has

probability

pk (1 p)n-k . The number of possible combinations of k 1s is

nk

"

# $

%

& ' =

n!k! (n k)!

, thus P(k)=

nk

"

# $

%

& ' pk (1 p)nk . This is known as the binomial distribution. The mean (average) value is np and the

standard deviation is

np(1 p) . In performing experiments, error bars of the size of the standard deviation are sometimes drawn. Thus, in Bernoulli trials where p is small, the certainty is about the square root of the number of observed 1s. For large n, the binomial distribution approximates the Gaussian or normal distribution in the body, but is generally a poor approximation to the tails (low probability regions) due to insufficient density of samples. The trials that occur within the standard deviation represent roughly 2/3 of the possible outcomes.


2-2

Note that if the trials are not independent, then little credibility can be attached to this uncertainty measure. Consider now two RVs X and Y. The joint probability distribution P(X,Y) is the probability that both x and y occur. In the event that outcomes x do not depend in any way on outcomes y (and vice versa) then X and Y are said to be statistically independent and thus P(X,Y)=P(X)P(Y). For example let x be the outcome of a fair coin toss and y the outcome of a second fair coin toss. The probability of a head or tail on either toss is 1/2, independently of what will happen in the future or what has happened in the past. Thus the probability of each of the four possible outcomes of (head, head), (head, tail), (tail, head), (tail, tail) is 1/4. The conditional distribution P(X|Y) is the probability distribution for X given that Y=y has occurred. When X and Y are statistically independent, this is simply P(X). More generally,

P(X |Y )P(Y ) = P(X,Y ) = P(Y | X)P(X) . (2.1) Example 2.2 Campaign contributions and votes in Dystopia. In the corrupt political system of the town of Dystopia, campaign contributions Y influence votes on legislation X. Let the two outcomes of X be {yea,nay} and let the sum of campaign contributions by persons favorable to the legislation, less the sum of contributions by persons unfavorable to the legislation be drawn from the set {-100, 100}. These outcomes are equally likely as are the voting outcomes. P(yea|100)=1, P(nay|-100)=1. Determine P(X,Y) and P(Y|X). Solution: Clearly P(yea|-100)=P(nay|100)=0. Thus P(yea,100)=P(yea|100)P(100)=1/2, and similarly P(nay|-100)=1/2, P(yea|-100)=P(nay|100)=0. Checking, the sum of all the possibilities is 1. Working from this, P(Y|X)=P(X,Y)/P(X), i.e., P(100|yea)=0.5/0.5=1=P(-100|nay); P(100|nay)=P(-100|yea)=0. Notice that while the causal relationship is that campaign contributions lead to a particular voting outcome, P(Y|X) still has meaning as an inference from the observations. It is often beneficial to use a form of this relationship known as Bayes Rule to rearrange problems in terms of probability distributions that are known or easily observed:

P(X|Y)=P(Y|X)P(X)/P(Y). In particular this relation is invoked in the derivation of the optimal detectors for signals in noise. Exercise: In the neighboring town of Utopia, campaign contributions have no influence on legislative outcome. Write P(X,Y), P(Y|X) and P(X|Y). Continuous random variables Discrete distributions arise whenever there is a finite alphabet of outcomes, as for example in communicating binary strings or addressing a digital visual display. Because digital signal processors always represent the world as a discretized and quantized approximation to continuous processes, computers exclusively deal with finite sets. However, many problems are much more naturally thought about in the context of random variables X spanning portions of the real numbers, with an associated probability density function (pdf) fX(x), which we will usually abbreviate as f(x). Note that the value


2-3

of a density function is not a probability; the pdf must be integrated over some finite interval (a,b) to obtain the probability that x lies in the interval. The expected value or mean of a random variable X is denoted by =E[X], given by

E[X] = xf (x)dx

. (2.2) More generally, the expectation operator involves an integration of the product of the argument of the expectation and the corresponding pdf. The nth moment is given by

E[X n ] = xn

f (x)dx . (2.3) Of particular importance are the mean and variance. The variance is given by

var(x) = 2 = E[(X )2] = (x )2

f (x)dx = E[X 2]2 (2.4) The square root of the variance is also known as the standard deviation in the particular instance of a Gaussian distribution. Two distributions of immense importance are the uniform and Gaussian (normal) distributions. In the uniform distribution, a random variable X takes on values with equal probability in the interval (a,b). Thus f(x)=1/(b-a). Notice that the integral of f(x) over (a,b) will be equal to 1. Many physical phenomena are approximately described by a Gaussian distribution. The ubiquity of the Gaussian distribution is a consequence of the Central Limit Theorem:

For n independent random variables Xi, the RV

X = 1n Xii=1

n

tends to the Gaussian distribution for n sufficiently large. That is,

f (x) 1 2 e

(x )2 /2 2 (2.5)

where is the mean of the distribution and the standard deviation. Associated with the Gaussian distribution are two functions which relate to the area under its tails. The Gaussian Q function is defined as

Q(x) = 12 ez 2 / 2

x

dz (2.6) This integral is the area under the tail of a zero mean Gaussian with unit standard deviation as sketched in Figure 2.1, and may only be solved numerically.


2-4

x

tail

f(x)

Q(x)

Figure 2.1: The Gaussian Q function A good approximation for x larger than 4 is

Q(x) 1x 2 e

x 2 / 2 (2.7)

The Q function is tabulated in Appendix A. Example 2.3 Gaussian tails The tail of a distribution is the region of low probability. For a Gaussian, the tails are bounded by exponentials. Using the approximation to the Q function, compute the probability that y is between 2 and 2.1 where f(y) is a N(0,0.25) distribution, i.e., a normal distribution with zero mean and a variance of 0.25 (standard deviation=0.5). Solution: Since the Q function is defined for a unit standard deviation normal distribution, one must perform a change of variables, scaling by the ratio of the standard deviations (and adjusting for the non-zero mean as might be required in some cases). Let x=y/=2y. Then P(2


2-5

I0( x)1 for x 0

I0( x)ex2x for x

I0( x) =12 x%

& '

(

) * 2m

m!m!m= 0

(2.10)

The lognormal distribution results from the application of the central limit theorem to a multiplicative process, that is, to a product of independent random variables. Let f (x) be N(0,2) where all quantities are expressed in decibels (dB). Then the lognormal distribution is what results after converting back to absolute units, i.e., Y=10X/10. Example 2.4 The Poisson distribution Suppose the probability of an event happening in a short time interval t is t, with the probability of two or more events occurring in this short interval being very low. This event might be the beginning of a telephone call or the reception of a photon in a detector. Such events are usually termed arrivals. The probability of no arrivals in this interval is then 1- t. Suppose further that the probability of an arrival in any given interval is independent of the probability of an arrival in any prior interval. Determine the probability of k arrivals in a longer interval T. Solution: This is of course related to the binomial distribution. Suppose T=nt, and let p=t. Then the probability of k arrivals is simply

n!k! (n k)!

pk (1 p)nk enp (np)k

k!= eT (T)

k

k!

where the approximation is valid if p


2-6

more convenient in the frequency domain. In solving these problems, resort is typically made to a number of tabulated transform pairs, as well as the following properties: 1. Linearity:

if x1(t ) X1( f ) and x2(t) X2( f ) thenc1x1(t) + c2x2(t ) c1X1( f ) + c2X2( f )

2. Time scaling:

x(at) 1| a | Xfa

#

$ %

&

' (

3. Duality:

X(t) x( f ) 4. Time Shifting:

x(t t0 ) X( f )e j 2ft0 5. Frequency Shifting:

e j2fc t x(t ) X( f fc ) 6. Differentiation in time domain:

dndtn x(t) ( j2f )

n X( f )

7. Integration in time domain:

x(t)dt

t

1j2f X( f ) 8. Conjugate functions:

x * (t) X * ( f ) 9. Multiplication in time domain (convolution in frequency domain):

x1(t)x2(t) X1()X2

( f )d 10. Convolution in time domain (multiplication in frequency domain):

x1(t)

x2(t )d X1( f )X2( f ) Example 2.5 Linear system A signal x(t) passes through a linear filter with impulse response h(t) to produce the output y(t). Then y(t)

is given by the convolution integral

x( t )

h( t )d . Suppose h(t) represents the lowpass filter

Asinc(2Wt) and x(t) is the sinusoid cos(2fct), where the carrier frequency is less than W. Compute y(t). Solution: This integral is rather difficult to directly evaluate. Instead, one may use the Fourier transform

pairs

Asinc(2Wt) A2W

rect f2W

#

$ %

&

' ( and

cos(2fc t)12((f fc ) + (f + fc )) together with property

(10) above to obtain

Y(f) = A4W

( (f fc ) + (f + fc )) . Taking the inverse transform,

y(t) = A2W cos(2fc t) . Avoidance of difficult integrals in dealing with linear systems is a major reason for use of the frequency domain. Another is that many of these problems are more easily visualized in the frequency domain. A question sometimes asked is what is the meaning of the frequency domain. It is perhaps best visualized from the point of view of Fourier series, whereby functions are composed of a weighted sum of harmonic components. The time and frequency domains are merely abstract models of the behavior of systems, and since both can potentially be complete representations they have equal value. The frequency domain is convenient in


2-7

dealing with some aspects of linear systems (e.g., filters) while the time domain is useful for other aspects (e.g., description of pulses). The problem solver needs facility in multiple abstract models, and indeed many orthogonal transformations have been developed for various signal processing purposes. As an example, the energy of a pulse can be computed in either time or frequency using the Parseval relation:

= | x(t)

|2 dt = | X( f )

|2 df . (2.13) Chances are that the integration is much less difficult in one domain compared to the other. Representation of Stochastic Processes Thus equipped, the representation of random processes can now be addressed. Let X be a continuous random variable with probability density function f (x). A random or stochastic process consists of a sequence of random variables X1, X2, X3, and their associated probability density functions. These random variables may for example be the samples in time of continuous process (e.g. speech), or simply the sample variables of a discrete-time process (e.g. a bit stream). In either case, the random process may be denoted by X(t). Random processes are typically classified as to how their statistical properties change with time, and the correlation of sample values over time, as well as according to which probability distribution(s) the samples are drawn from. Consider for example a Gaussian random process, in which f (x) is N(0,2) and does not change with time. This is a stationary random process. However, this is not a complete description of the random process, since it neglects any measure of how adjacent samples are correlated in their values. Now it would in principle be possible for a discrete process to have the complete set of joint distributions to fully characterize the random process. However, this is unwieldy and in any case could never be measured to any reasonable accuracy. Instead this correlation in time is usually captured in the autocorrelation function, which for a complex random process X(t) of finite power is given by

x (t) =lim

T1T x * (t)x(t + )T/2

T/2

dx (2.14)

where is the delay between two versions of the signals. Properties of the autocorrelation function include: 1.

x ( ) = x * ( ) (conjugate symmetry) 2.

x ( ) x (0) (maximum for zero delay) 3.

x (t) SX (f) (Fourier transform pair with power spectral density) 4.

x (0) = E[SX (f)] (average power) 5. If for a periodic function x(t+T)=x(t) for all t, then

x ( + T) = x ( )for all . (autocorrelation function also periodic).


2-8

The first two properties give an indication of the general shape of an autocorrelation function. The magnitude is symmetric about a peak, and the function generally decays as one moves away from the peak. This corresponds to the fact that the samples of the random process are generally less correlated with one another the further apart they are in time. Property 3 actually defines the power spectral density (psd). The integral of Sx(f ) over some interval (f1,f2) is the power of the process in this interval. As for linear systems and deterministic signals, the Fourier relationships between autocorrelation functions and the psd enables convenient calculation of the outputs of filters when the input is a random process. If X(t) is the input process, Y(t) the output process, and h(t) the impulse response of the time-invariant linear filter, then

SY ( f ) =|H( f ) |2 SX ( f ). Example 2.6 Colored noise White Gaussian noise has psd SX(f )=No/2, extending over all frequencies. What is its autocorrelation function? Solution: Simply take the inverse Fourier transform, so that x()=(No/2)() Now suppose the white noise process passes through a linear filter with response h(t)=1 over (0,T) and 0 otherwise. What is the autocorrelation function of the output process y(t)? Solution: SY(f )=|H(f )|2No/2. The Fourier transform of a rectangular pulse is a sinc function, and the inverse transform of the sinc2 function is a triangle function. Hence the autocorrelation function is simply computed by multiplying h() by No/2, i.e., y() is as plotted below

y()

T-T

T2No/2

Figure 2.2 Autocorrelation function of band-limited noise The above discussion assumed the random process to be time invariant. More generally, for a real random process the autocorrelation function is given by

x (t1, t2 ) = E[X(t1)X(t2 )]. (2.15) For a random process to be termed strict-sense stationary (SSS), it is required that none of its statistics are affected by a time shift of the origin. This is impractical to measure, and usually one settles for showing that a process is wide sense stationary (WSS). Two requirements must be met:

1.

E[X(t)] = X = constant 2.

x (t1, t2 ) = x (t1 - t2 ) = x ( ) WSS Gaussian random processes are also SSS, but this is not generally true of other random processes. Note that the second property also implies the second moment is stationary, since the autocorrelation function at =0 is the power of the process. In practice, many signals have slowly drifting statistics, and so the assumption is often made that they are quasi-static for purposes of calculations meant for some short time interval.


2-9

Gaussian random processes have other special properties. The convolution of a Gaussian process with a fixed linear filter produces another Gaussian process. If the input process is zero mean, so is the output process. However, convolution does change the autocorrelation function of the output process. To take a simple example, suppose the input process is white Gaussian noise. That is, the time samples follow a Gaussian distribution, and power spectral density is flat (equal) over all frequencies. This results in the autocorrelation function being a delta function. In consequence, the output random process will continue to be Gaussian, but the output process psd will be colored by the frequency response of the filter, as discussed in Example 2.6. Sampling theorem While the physical world is reasonably modeled as being a (rather complicated) set of continuous real- or complex-valued random processes, digital computers in the end can only deal with binary strings. Thus understanding is required of the process of transformation of analog signals into digital form and back if computations are to be coupled to the physical world. There are two aspects to these transformations: discretization in time by sampling, and quantizing the samples to a finite number of levels. The Shannon-Hartley sampling theorem states that any random process strictly limited to a range W Hz in frequency can be completely represented by 2W samples per second. That is, it is possible to reconstruct the original process with complete fidelity from this set of samples. Viewed another way, this means that a random process strictly bandlimited to W Hz represents a resource of 2W independent dimensions per second. This viewpoint makes the connection of the sampling theorem to earlier work by Nyquist, who proved that the maximum rate at which impulses could be communicated over a channel bandlimited to W Hz without interference among them in the receiver is 2W pulses per second (known subsequently as the Nyquist rate). Thus in sampling, at least 2W samples per second are needed to be able to reproduce the process, while in communicating at most the 2W independent dimensions per second are available. Achieving reproduction or communication at these limits demands extremely long filters, and so in practice the signals are sampled more quickly or communicated more slowly. The sampling theorem is illustrated in Figure 2.3, where on the left is illustrated the picture in the time domain, and on the right the picture in the frequency domain. The sampling function s(t) is a train of equally spaced delta functions with spacing T, known as a Dirac comb. The Fourier transform of a Dirac comb is another Dirac comb S(f ) with spacing 1/T in the frequency domain. Time multiplication corresponds to convolution in the frequency domain, resulting in multiple copies of the spectral image of the bandlimited process. If T is small enough, the spectral copies will not overlap and the original process can be reproduced using a low-pass filter. Otherwise, there is no means for reproduction. Clearly 1/T > 2W.


2-10

x(t) X(f)

Multiply by sampling function s(t) Convolve with sampling function S(f)

x(t)s(t) X(f)*S(f)

Low pass filter to recover X(f)

Figure 2.3 The sampling theorem Actual filters are not strictly bandlimited, and neither usually is the process being sampled. Over-sampling results in reproduction filters with lower required order for a given level of reproduction fidelity, but of course will result in more samples per second being required. Under-sampling results in aliasing: overlap of the spectral copies. Usual practice is to cut off the upper bandedges using sharp anti-aliasing filters prior to sampling. While this eliminates the high frequency content, the result is usually less damaging than permitting aliasing. Example 2.7: Sampled square wave A square wave g(t) symmetric about the origin with amplitude 1 and period T is to be sampled at rate 5/T, What must be the cutoff frequency of the anti-aliasing filter, and what waveform will be reproduced from the samples by a low-pass reconstruction filter? Solution: Since the square wave is periodic, the Fourier series may be used. For an even function, only the cosine terms need be considered. It has zero mean and thus

g(t) = ann=1 cos(2nt/T) where an =

1T

g(t)cos(2nt/T)T/2T/2 dt

Integration yields an =2n sin

n2

' ( )

* + , , which is 0 for even n.

Sampling at rate 5/T implies that frequency components above 2.5/T will be aliased. Thus since the waveform has no frequency content between 1/T and 3/T any cutoff frequency in this region will do. The filtering will produce a sinusoid at the fundamental frequency. Sampling and reconstruction thus result in

gr (t) =2cos(2t/T) .


2-11

Quantization Quantization of the samples into discrete levels introduces distortion that cannot be reversed. The difference between the input (real) value and the quantized value is known as quantization noise. Using more finely spaced levels is of course one way to reduce the quantization noise. Over a very broad class of quantization schemes, each time the number of levels is doubled the quantization noise is reduced by 6 dB. That is, for each extra bit of quantization the mean squared error is reduced by a factor of 4. A quantizer acts by dividing up the space into bins, with every signal lying within the bin assigned the value of a reproducer level somewhere within the bin. In the case of a one-dimensional linear quantizer the bins are evenly spaced and the reproducer levels are in the centers of the bins. Associated with each bin is a bit label. Thus if there are 8 bins each reproducer level requires a 3-bit label, as illustrated in Figure 2.4 for a uniform quantizer. In this example, the bins have a size of 1, and the reproducer values are placed in the bin centers (denoted by arrows), taking on values {-3.5, -2.5,,3.5}. Thus if x=1.75, the corresponding reproducer level is 1.5 and the bit label is 101. The error or quantization noise is 0.25. The analog to digital (A/D) conversion process thus consists of anti-aliasing filtering followed by sampling, and then quantization with the input process represented as a sequence of binary labels, one per sample.

x=1.75

1 2 3-1-2-3

000 001 010 011 100 101 110 111

Figure 2.4 3-bit uniform quantizer In the digital to analog conversion process, the voltage levels corresponding to these reproducer levels may be read out of memory (indexed by the labels), an impulse generated, and then the impulse sequence passed through a low-pass filter to approximately reproduce the analog signal. The complete A/D and D/A sequence is illustrated in Figure 2.5.

X(t) h(t)

Low passfilter

quantizer BinaryLabels

BinaryLabels Reproducer h(t)

Low passfilter

Y(t)

Sampler

Figure 2.5 A/D and D/A conversion steps


2-12

A variety of means exist to get superior performance to uniform linear quantizers. These are particularly important when the distortion measure of interest is something other than mean squared error, as is often the case with visual, auditory or other perceptual signals. Example 2.8 The -law quantizer The amplitude levels of ordinary speech follow a Gaussian distribution. In the auditory system it is the relative distortion that is perceptible rather than the absolute level. That is, the fractional distortion from the waveform is what is important. Consequently larger absolute errors can be made on the loud signals compared to the quiet signals. In telephone network local offices, signals from the local loop are strictly bandlimited to 3.5 kHz, sampled at 8 kHz, and then quantized into 256 levels (8 bits) to produce a 64 kb/s stream. This stream is then what is transmitted by various digital modulation techniques over the long distance network. At the local office on the other side, the stream is converted back to an analog signal for transmission over the local loop. To achieve approximately equal fractional distortion at each amplitude, since the Gaussian distribution has exponential tails one could take the logarithm and then apply a uniform quantizer. Instead, this is combined into one step with a set of piece-wise linear uniform quantizers with increased spacing for the larger amplitudes. Thus finely spaced levels are used for low-amplitude signals and widely spaced ones for large amplitude signals. This allows easily constructed linear quantizers to be used on signals with a broad dynamic range while limiting the number of bits required. An unintended consequence is increased difficulty for the design of voice-band modems due to the amplitude-dependent quantization noise this introduces. 2.3 Introduction to information theory The sampling theorem provides one example of a fundamental limit in digital signal processing (with the converse of the Nyquist rate providing a limit in digital communications). The quantization example raises the question of what might be the ultimate limits in terms of the least distortion that can be achieved with a given number of bits in representing a random process. Information theory provides a powerful set of tools for studying the questions of fundamental limits, producing guideposts as to what is possible, and sometimes on the means for approaching these limits. This section will introduce three basic domains of information theory and the associated fundamental limits: noiseless source coding, rate distortion coding, and channel coding. Remarkably, the topics themselves and the early theorems were all the work of one man, Claude Shannon. In the fifty years since the publication of A mathematical theory of communication, practical methods have been devised that approach a number of these limits, and information theory has expanded to broader domains. Entropy and information Shannon postulated that a useful information measure must meet a number of requirements. First, independent pieces of information should add. Second, information should be measured according to the extent that data is unexpected. Hearing that it will be sunny in Los Angeles has less information content than the statement in January that it will be sunny in Ithaca, New York. The most important insight however is that information content of sequences is more interesting and fundamental than consideration of individual messages. The particular measure he adopted is termed the entropy, since it


2-13

expresses the randomness of a source. Let X be a RV taking values in the set X={x1,x2,xn} with the probability of element i being P(xi). Then the entropy in bits is given by

H(X) = P(xii=1

n

)logP(xi) (2.16) where the logarithm is base 2. Since the probabilities are between 0 and 1, and the logarithm is negative over that range, the entropy is a positive quantity. In the particular case that X takes on binary values the entropy achieves its maximum value of 1 bit when the outcomes are equally likely, and a minimum of 0 when either outcome is certain. Consider two successive instances of fair coin tosses. The probabilities for each of the four possible outcomes are each 1/4, and the entropy is 2 bits as would be expected. Thus it is a plausible information measure. The noiseless source coding theorem is to the effect that the entropy represents the minimum number of bits required to represent a sequence of discrete-valued RVs with complete fidelity. Thus ascii or English text, or gray-levels on a fax transmission all have a minimum representation, and that is the entropy of the sequence. Indeed practical source coding techniques have been developed that get very close to the entropy limit. Shannon suggested a method that illuminates some very important facts about correlated sequences. Asymptotic Equipartition Property The weak law of large numbers states that for iid RVs X1, X2,,Xn,

1n Xii= 1

n

E[X] as n (2.17) This is easily applied to prove an important idea in information theory, the asymptotic equipartition property (AEP) which states that if P(x1, x2,,xn) is the probability of observing the sequence x1, x2,,xn, then -1/n log P(X1, X2,,Xn) is close to H(X) for large n. According to the AEP, a small set of the sequences, the typical set, will consume almost all of the observations, the members of the typical set are equally likely, and the number of typical sequences is approximately 2nH. The remaining sequences, although large in number, are collectively only rarely observed. Example 2.9 Typical binary sequences Consider a binary source with probabilities of the two outcomes being (1/4,3/4). Compute (a) H(X) (b) the number of sequences of length 100 (c) the size of the typical set and (d) the number of bits required to uniquely identify each member of the typical set. Solution: H(X)=-1/4 log 1/4 3/4 log 3/4 = 0.811. The number of sequences of length 100 is simply 2100 = 1.27x1030, while the number of typical sequences is 281=2.42x1024, six orders of magnitude less. 82 bits are needed to index the members of the typical set (rounding up). A data compression method due to Shannon which exploits the typical set is as follows. Allocate one bit to labeling whether a sequence is a member of the typical set or not. Then enumerate the typical set members using their indices as the codewords. Since there are 2nH typical set members, the index length will be at most the integer above nH,


2-14

plus 1 for the indication of typical set membership. For the atypical set members, at most nlog|X| +1 bits will be required; simply send them in raw form. The expected number of bits required to describe a message will however only be nH+1 bits, since the atypical set has vanishing probability as n gets large. Thus, the number of bits per symbol required will approach the entropy H, the smallest possible according to the noiseless source coding theorem. Note that the coding scheme is fully reversible, there being a one to one correspondence between codewords and the original data. Thus, effort is spent on the typical set, and it is a matter of little importance how the vastly more numerous atypical set members are dealt with. Given that the conclusion depends only upon the weak law of large numbers, this is a design principle with broad implications. In practice, the source statistics may be time-varying or unknown, or may be such that more highly structured solutions will converge to the entropy bound with shorter sequences. Consequently, there exist a wide variety of compression algorithms. Mutual information Just as there are joint and conditional probability distributions, so there are joint and conditional entropy functions. The joint entropy between two variables X and Y is

H(X,Y) = P(x, y)logP(x, y)yY

xX (2.18)

while the conditional entropy is defined as

H(Y | X) = P(x)H(Y | X = x) =xX P(x, y)

yY logP(Y | X)

xX . (2.19)

The joint and conditional entropies are related by the chain rule:

H(X,Y) = H(X) + H(Y | X) = H(Y) + H(X |Y) (2.20) Notice that it is not generally the case for H(X |Y) to be the same as H(Y |X). A question that naturally arises is the extent to which knowledge of X assists in knowing Y. This is measured by the mutual information

I(X;Y ) = H(X) H(X |Y ) or equivalently,

I(X;Y ) = H(Y ) H(Y | X) . The mutual information can thus be thought of as either the reduction in uncertainty of X due to knowledge of Y, or the reduction in the uncertainty of Y due to knowledge of X. Note that

I(X;Y ) = H(X) H(X | X) = H(X) , since

H(X | X) must be zero. Also note that

I(X;Y ) = H(X) + H(Y ) H(X,Y ) . Thus, if X and Y are independent, the mutual information is zero. Capacity It comes as no surprise that the question of ultimate limits in communication systems will involve the mutual information between the transmitted sequence, and the received sequence. Less obvious is that the problem of lossy compression is actually the dual of the communications problem, and relates directly to the problems of detection and source identification. The mutual information thus plays a central role in these problems as well. The problem of communicating over a single one-way communications link is depicted in Figure 2.6. A message sequence is encoded to produce the vector Xn, and then a channel


2-15

produces output sequence Yn according to probability transition matrix P(y|x). A decoder then estimates the original sequence. The aim is to maximize the rate of transmission while exactly reproducing the original sequence.

Message Encoder ChannelP(y|x)

Decoder MessageEstimate

Xn Yn

Figure 2.6 Discrete communication problem The Shannon channel capacity is then given by

C = maxP(x) I(X;Y) (2.21) where the maximum is over all possible input distributions P(x). The effect of the channel is to produce a number of different possible images of the input sequences in the output (typically as the result of noise). For each input sequence, there will be approximately

2nH (Y |X )output sequences. To avoid errors, it must be assured that no two X sequences generate the same output sequence. Otherwise there will be ambiguity in decoding, rather than a clear mapping. The total number of typical Y sequences is

2nH (Y ),and thus the maximum number of disjoint sets is

2n(H (Y )H (Y |X )) = 2nI (X ;Y ). This is the maximum number of disjoint sequences of length n that can be sent, or in other words, the capacity is I(X;Y) bits per symbol. Omitting the proof that this is indeed both achievable and the best that can be done, it can be noticed that the notion of typicality again is important. Here the number of typical output sequences per input is determined by the noise process, and thus it is the noise that fundamentally limits the throughput. Capacity can also be defined for channels with discrete inputs but with output vectors drawn from the reals. In such cases, the differential entropy is used, defined by

h(X) = f(x)logf(x)dxS (2.22)

where S is the support set of the random variable. An analogous set of information theoretic identities can be constructed as for the entropy. Here as well it is found that the capacity is fundamentally limited by the noise. The most important example is the Gaussian channel, in which signals of power P and bandwidth W are passed through a channel which adds white Gaussian noise of power spectral density No/2. Then

C =W log(1+ P /WNo) bits/s. (2.23) That is, capacity (not reliability) increases linearly with bandwidth (the dimensions available) and logarithmically with signal to noise ratio (SNR). Viewed another way, given fixed bandwidth and data rate requirements, there is some minimum SNR that is required for reliable transmission. Below this SNR, reliable communication is not


2-16

possible at the desired transmission ratethe error rate very sharply escalates to unusable levels. The aim of channel coding is to approach this limit with reasonable computational delay and complexity. Codes have in fact been developed that very closely approach capacity. Example 2.10 Water-filling and multiple Gaussian channels

N1 N2 N3

P1 P3

Ch. 1 Ch. 2 Ch. 3

Power

Figure 2.7 Three parallel Gaussian channels Consider the transmission over a set of Gaussian channels, in which the noise power may be changing with time or may be different according to the frequency used. It might at first be tempting to increase the transmit power when the channel is poor, and decrease it when the noise is small to maintain a constant information rate. However, this is wrong approach when attempting to achieve maximum rate. The optimal way to allocate transmit power is according to the water-filling distribution, illustrated above. Begin by pouring power into the channel with the lowest noise. When it reaches the level of the next best channel, continue pouring into both. Continue until all the power is exhausted. The depth of the power in the channel is then the power allocated. In Figure 2.7, the noise in channel 2 is so large that it is not used. However, if the total available power were to increase eventually power would be allocated to all channels. In the limit of high SNR, the optimal power allocation is approximately equal over all channels. In any case, the capacity is computed using the usual formula for Gaussian channel capacity for each of the channels and summed (if they are parallel) or averaged (if they are accessed serially). In network information theory (to be discussed in Chapter 8), various problems of communication among groups of communications transceivers are considered, including broadcasting, relaying of messages, multiple access, and hybrid broadcast/relay transmission schemes. As might be expected, the general problem of capacity of a network while considering all the ways the elements may interact is as yet unsolved, although solutions exist for a number of special cases. Rate distortion coding Quantization is an example of noisy source codingthe conversion of a sequence of random variables into a more compact representation of rate R, where there is some distortion D between the original and reconstructed sequence (e.g., the quantization noise). The aim may be either to minimize R subject to a constraint on the maximum distortion D, or minimize D subject to a constraint on the maximum rate R. The question naturally arises as to which (R,D) pairs are actually achievable. Suppose the distortion measure is the squared error (Y-X)2 between the input sequence of RVs X (e.g., drawn


2-17

from the reals) and the reconstructed sequence Y (some finite set). Let D be the maximum permitted distortion. The rate distortion theorem states

R(D) = minf(y | x) : E[(Y X)2] DI(X;Y) (2.24) Consider the particular instance of using the mean squared error E[(Y-X)2] as the distortion measure, and let the source generate a sequence of iid Gaussian RVs with zero mean and variance 2. Rate distortion problems are generally considered from the point of view of a test channel, that is, like a communications problem where Y is communicated over a noisy channel that adds a random variable Z, to produce the source X. For the Gaussian source, the channel along with the probability distributions for the variables are illustrated in Figure 2.8.

+Y; N(0,2-D)

Z; N(0,D)

X; N(0,2)

Figure 2.8 Test channel for Gaussian source This particular choice of the distribution produces a conditional density f(X|Y) that achieves the rate distortion bound, so that

I(X;Y ) =1/2log 2 /D and the rate distortion function is given by

R(D) = 12 log 2

D , 0 D 2, and 0 otherwise. (2.25)

Another way of looking at this is distortion as a function of rate,

D(R) = 222R . Thus, as noted earlier with respect to the discussion on quantizers, each additional bit reduces the distortion by a factor of 4, or 6 dB. This is in keeping with the general trend for quantizers that the distortion in dB is of the form D(n)=k-6n dB, where n is the number of bits. Because the optimal rate distortion coder operates on sequences rather than performing operations symbol by symbol, it achieves a lower value for the constant k. Vector quantizers can closely approach the rate distortion bound. In lossy source coding, in practice there are generally a number of steps. First some heuristic is applied to remove most of the correlation of the source, using knowledge of the source statistics, and in the case of video or audio signals, knowledge of which features have the largest perceptual impact. Then rate-distortion coding or lossless source coding is applied to remove the remaining correlation in the sequence. In sensor networks, it is generally not feasible for reasons of network scalability or energy consumption to send all the data collected by sensors to some central location for processing. Thus sensor nodes must perform some form of lossy compression. In the extreme, decisions must be made locally for example to identify a source and thus simply describe it by a few bits. A research question is the trade between network resource consumption (energy, bandwidth) and the fidelity of the final decisions fused from the


2-18

(processed) observations of set of nodes. The former may be identified with rate, and the latter with the distortion as a network rate distortion problem. Data processing inequality Another information theoretic result of considerable importance in design of receivers (whether for sensing or communications) is the data processing inequality. Consider the information processing system depicted in Figure 2.9.

X P(Y|X) P(Z|Y) Z

Figure 2.9 An information processing system Here a source X passes through a block (e.g. a channel) which produces Y according to the conditional distribution P(Y |X). A processing step produces Z according to the conditional distribution P(Z |Y). Consequently the sequence X to Y to Z forms a Markov chain. In this situation, the data processing inequality (proved using the chain rule for mutual information) states that

I(X;Y) I(X;Z) . That is, processing cannot increase the mutual information between the received sequence Y and the source sequence X; it can at best keep it constant and may destroy information. This is also true in the particular case that Z=g(Y) where g is an arbitrary function. One does not get something from nothing through the magic of computer processing; all the information is in the original signal, and it is likely all downhill from there. What processing can do is apply information-preserving transformations that present the data in a more convenient form. In an optimal receiver, the processing steps should thus be designed to preserve the mutual information between the end result and the original source. If a more compact representation is required due to constraints, then a reasonable goal is to maximize the mutual information subject to rate or distortion constraints. 2.5 Summary This chapter has laid out the basic description of signals as random processes. The interface between continuous time signals and representations suitable for computers was described. The sampling theorem indicates the minimum rate at which a random process must be sampled in order for the possibility of accurate signal reconstruction to be preserved. Information theory lays out fundamental limits on the compressibility of the resulting data stream, and the minimum error that results in the transformation to a set of discrete values. It further supplies fundamental limits on the resources required for communication of data. 2.6 To inquire further


2-19

Stochastic processes and linear systems are treated at an introductory level but with sufficient mathematical rigor in S. Haykin. Communication Systems 4th Ed. Wiley, 2001. A standard comprehensive reference for probability is A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, 1965. The discussion on information theory was abstracted from the highly readable textbook T.M. Cover and J.A. Thomas, Elements of Information Theory. Wiley, 1991. It uses an economy of mathematical techniques while discussing a very broad range of topics in information theory from the basics to gambling on the stock market. A very accessible overview of the implications of information theory is found in R.W. Lucky. Silicon Dreams: Information, Man, and Machine. St. Martins Press, 1989. An excellent book covering much of the content of this chapter in more detail (but sadly out of print) is: P.M. Woodward. Probability and Information Theory, with Applications to Radar. Pergamon Press, 1953. Of particular note is the connection drawn between maximization of mutual information and optimal receiver design. The sampling theorem as stated here in the context of the Fourier transform does not suggest efficient procedures for dealing with discontinuities (e.g., edges, as arise in images). Theory can alternatively be constructed for other transforms or on the basis of the information content of a random process. See for example A. Ridolfi, I. Maravic, J. Kusuma and M. Vetterli, A Bernoulli-Gaussian approach to the reconstruction of noisy signals with finite rate of innovation, submitted to IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2004. 2.7 Problems 2.1 The Rayleigh Distribution Let z be Rayleigh distributed with pdf:

f(z) = z 2ez 2/2 2 ,z > 0 .

Find the mean and the variance of z. 2.2 The Binary Symmetric Channel Consider the binary symmetric channel (BSC), which is shown in Figure 2.10. This is a binary channel in which the input symbols are complemented with probability p. The probability associated with each input-output pair is also shown in the figure. For example, the probability that input 0 is received as output 0 is 1-p. Compute the channel capacity.


2-20

Figure 2.10 The binary symmetric channel 2.3 The binary entropy function Let

X = 1 with probability p,0 with probability 1 p.# $ %

Plot H(X) as a function of p. Prove the maximum occurs when the likelihoods of the two outcomes are equal. 2.4 The Max quantizer Let

X ~ N(0, 2) (that is, X takes on a normal distribution) and let the distortion measure be the squared error. Consider a 1-bit Max quantizer for this Gaussian source. One bit is used to represent the random variable X. If the random variable is greater than or equal to 0, the value is represented by the digital 1. Otherwise, it is represented by the digital 0. The 1 and 0 correspond to two analog reproducer values used to produce the reconstruction )( XX . Determine these two analog values such that the mean square error

E[X X (X)]2 is minimized. Also compute the expected distortion for this quantizer, and then compare this with the rate distortion limit. 2.5 The binomial distribution Compute the mean and standard deviation of the binomial distribution shown in Example 2.1.

2.6 Comparison of Gaussian and binomial distributions To demonstrate the similarity between the binomial and the Gaussian distributions, consider a binomial distribution with n=150 and p=0.5. First, plot the probability density functions of both distributions with the same variance. Also plot the absolute values of difference between two distributions, without and with normalization to the values of binomial distribution. Finally, for the binomial distribution compute the probability that the trial outcome is within the standard deviation.

2.7 The lognormal distribution Let

X ~ N(, 2 ) and

Y =10X/10. Therefore Y is lognormal distributed. Compute the mean and the variance of this lognormal distribution.

Input Output

0

1

0

1

1-p

1-p

p

p


2-21

Hints: It is suggested to solve this problem using moment generating functions. The moment generating function of x is defined by

MX (t) Ee tx . Since Y can be expressed as the exponential of x. By substituting t with specific values, the mean and the variance can easily be obtained. The generating function is also a powerful tool for other distributions. 2.8 The cumulative density function In addition to the probability density function (pdf), another function that is also commonly used is the probability distribution function. It is also called cumulative distribution function (cdf). It is defined as

F(x) = P(X x) ( < x


2-22

Compute the Fourier transform of the following signals without resorting to the Fourier transform pair table:

(a)

rect( t2T1) 1 | t |< T10 | t |> T1

# $ %

(b)

(t) (c)

eatu(t),Re{a} > 0

2.12 Some more challenging transforms Compute the Fourier transform of each of the following signals, using the results in problem 2.7.12 and the properties described in this chapter.

(a)

cos(2f0t) .

(b) Periodic square wave

x(t) =1, | t |< T10, T1


2-23

In this problem, A/D and D/A conversion are practiced and two simulations are compared. Consider the steps in Figure 2.11

x(t) )(1 th quantizerBinaryLabels

BinaryLabels Reproducer )(2 th y(t)

Figure 2.11 A/D and D/A conversion

The quantizer is the 3-bit uniform quantizer shown in Figure 2.4. The reproducer is a D/A converter. It converts the binary labels into analog values and holds the analog value until the next new binary label is available. )(1 th and )(2 th are both low pass filters, where

)(1 th is called the anti-aliasing filter. The transfer function of both filters is a Butterworth low pass filter:

128243

12

21 10984.110158.310513.210984.1)()(

+++

==

ssssHsH

The input is a triangle wave as depicted in Figure 2.12.

4

-4

0 T0

Figure 2.12 A triangle wave

T0 =1/ f0 =1 ms . The sample rate is 20k samples/s. In the first experiment, compute the mean square error. Notice, the A/D converter has no information about the input phase. In the second experiment, the anti-aliasing filter )(1 th is removed. Compute the mean square error in this situation.

principles of embedded networked systems design by gregory j. pottie and william j. kaiser ch2

Documents