random variables, vectors, and processes Ω · random variables, vectors, and processes ee278:...

EE 278Lecture Notes # 3Winter 2010–2011

Random variables, vectors,and processes

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Random Variables

Probability space (Ω,F , P)

A (real-valued) random variable is a real-valued function defined onΩ with a technical condition (to be stated)

Common to use upper-case letters. E.g., a random variable X is afunction X : Ω→ R. Y,Z,U,V,Θ, · · ·Also common: random variable may take on values only in somesubset ΩX ⊂ R (sometimes called the alphabet of X, AX and X alsocommon notations)

Intuition: Randomness is in experiment, which produces outcome ωaccording to probability P⇒ random variable outcome isX(ω) ∈ ΩX ⊂ R.


Examples

Consider (Ω,F , P) with Ω = R, P determined by uniform pdf on [0, 1)

Coin flip from earlier: X : R→ 0, 1 by

X(r) =

0 if r ≤ 0.51 otherwise

.

Observe X, do not observe outcome of fair spin.

Lots of possible random variables, e.g., W(r) = r2, Z(r) = er, V(r) = r,L(r) = −r ln r (require r ≥ 0), Y(r) = cos(2πr), etc.

Can think of rvs as observations or measurements made on anunderlying experiment.


Functions of random variables

Suppose that X is a rv defined on (Ω,F , P) and suppose thatg : ΩX → R is another real-valued function.

Then the function g(X) : Ω→ R defined by g(X)(ω) = g(X(ω)) is alsoa real-valued mapping of Ω, i.e., a real-valued function of a randomvariable is a random variable

Can express the previous examples as W = V2, Z = eV, L = −V ln V,Y = cos(2πV)

Similarly, 1/W, sinh(Y), L3 are all random variables


Random vectors and random processes

A finite collection of random variables (defined on a commonprobability space (Ω,F , P) is a random vector

E.g., (X,Y), (X0, X1, · · · , Xk−1)

An infinite collection of random variables (defined on a commonprobability space) is a random process

E.g., Xn, n = 0, 1, 2, · · · , X(t); t ∈ (−∞,∞)

So theory of random vectors and random processes mostly boilsdown to theory of random variables.


Derived distributions

In general: “input” probability space (Ω,F , P) + random variable X ⇒“output” probability space, say (ΩX,B(ΩX), PX), where ΩX ⊂ R and PX

is distribution of X PX(F) = Pr(X ∈ F)

Typically PX described by pmf pX or pdf fX

For binary quantizer special case derived PX.

Idea generalizes and forces a technical condition on definition ofrandom variable (and hence also on random vector and randomprocess)


Inverse image formula

Given (Ω,B(Ω), P) and a random variable X, find PX

Basic method: PX(F) = the probability computed using P of all theoriginal sample points that are mapped by X into the subset F:

PX(F) = P(ω : X(ω) ∈ F)

Shorthand way to write formula in terms of inverse image of an eventF ∈ B(ΩX) under the mapping X : Ω→ ΩX: X−1(F) = r : X(r) ∈ F:

PX(F) = P(X−1(F))

Written informally as PX(F) = Pr(X ∈ F) = PX ∈ F = “probability thatrandom variable X assumes a value in F”


X−1(F) X

F

Inverse image method: Pr(X ∈ F) = P(ω : X(ω) ∈ F) = P(X−1(F))

inverse image formula — fundamental to probability, randomprocesses, signal processing.

Shows how to compute probabilities of output events in terms of theinput probability space does the definition make sense?

i.e., is PX(F) = P(X−1(F)) well-defined for all output events F??

Yes if include requirement in definition of random variable —


Careful definition of a random variable

Given a probability space (Ω,F , P), a (real-valued) random variableX is a function X : Ω→ ΩX ⊂ R with the property that

if F ∈ B(ΩX), then X−1(F) ∈ F

Notes:

• In English: X : Ω→ ΩX ⊂ R is a random variable iff the inverseimage of every output event is an input event and thereforePX(F) = P(X−1(F)) is well-defined for all events F.

• Another name for a function with this property: measurablefunction


• Most every function we encounter is measurable, but calculus ofprobability rests on this property and advanced courses provemeasurability of important functions.

In simple binary quantizer example, X is measurable (easy to showsince F = B([0, 1)) contains intervals) Recall

PX(0) = P(r : X(r) = 0) = P(X−1(0))= P(r : 0 ≤ r ≤ 0.5) = P([0, 0.5]) = 0.5

PX(1) = P(X−1(1)) = P((0.5, 1.0]) = 0.5

PX(ΩX) = PX(0, 1) = P(X−1(0, 1) = P([0, 1)) = 1

PX(∅) = P(X−1(∅)) = P(∅) = 0,

In general, find PX by computing pmf or pdf, as appropriate.Many shortcuts, but basic approach is inverse image formula.


Random vectors

All theory, calculus, applications of individual random variables usefulfor studying random vectors and random processes since randomvectors and processes are simply collections of random variables.

One k-dimensional random vector = k 1-dimensional randomvariables defined on a common probability space.

Earlier example: two coin flips, k-coin flips (first k binary coefficientsof fair spinner)

Several notations used, e.g., Xk = (X0, X1, . . . , Xk−1) is shorthand forXk(ω) = (X0(ω), X1(ω), . . . , Xk−1)(ω)

or X or Xn; n = 0, 1, . . . , k − 1 or Xn; n ∈ ZkEE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 11

Can be discrete (discribed by multidimensional pmf) or continuous(e.g., described by multidimensional pdf) or mixed

Recall that a real-valued function of a random variable is a randomvariable.

Similarly, a real-valued function of a random vector (several randomvariables) is a random variable. E.g., if X0, X1, . . . Xn−1 are randomvariables, then

S n =1n

n−1

k=0

Xk

is a random variable defined by

S n(ω) =1n

n−1

k=0

Xk(ω)


Inverse image formula for random vectors

PX(F) = P(X−1(F)) = P(ω : X(ω) ∈ F)= P(ω : (X0(ω), X1(ω), . . . , Xk−1(ω)) ∈ F)

where the various forms are equivalent and all stand for Pr(X ∈ F)

Technically, the formula holds for suitable events F ∈ B(R)k, the Borelfield of Rk (or some suitable subset). See book for discussion.

One multidimensional event of particular interest is a Cartesianproduct of 1D events (called a rectangle):F = ×k−1

i=0 Fi = xk : xi ∈ Fi; i = 0, . . . , k − 1

PX(F) = P(ω : X0(ω) ∈ F0, X1(ω) ∈ F1, . . . , Xk−1(ω) ∈ Fk−1)EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 13

Random processes

A random vector is a finite collection of rvs defined on a commonprobability space

A random process is an infinite family of rvs defined on a commonprobability space. Many types:

Xn; n = 0, 1, 2, . . . (discrete-time, one-sided)

Xn; n ∈ Z (discrete-time, two-sided)

Xt; t ∈ [0,∞) (continuous-time, one-sided)

Xt; t ∈ R (continuous-time, two-sided)

Also called stochastic process


In general: Xt; t ∈ T or X(t); t ∈ T

Other notations: X(t), X[n] (for discrete-time)

Sloppy but common: X(t), context tells rp and not single rv

Also called a stochastic process. Discrete-time random processesare also called time series

Always: a random process is an indexed family of random variables,T is index set

For each t, Xt is a random variable. All Xt defined on a commonprobability space

index is usually time, in some applications it is space, e.g., randomfield X(t, s); t, s ∈ [0, 1) models a random image,V(x, y, t); x, y ∈ [0, 1); t ∈ [0,∞) models analog video.


Keep in mind the suppressed argument ω— e.g., each Xt is Xt(ω), afunction defined on the sample space

X(t) is X(t,ω), it can be viewed as a function of two arguments

Have seen one example — fair coin flips, a Bernoulli random process

Another, simpler, example:

Random sinusoids Suppose that A and Θ are two random variableswith a joint pdf fA,θ(a, θ) = fA(a) fΘ(θ). For example, Θ ∼ U([0, 2π))and A ∼ N(0,σ2). Define a continuous-time random process X(t) forall t ∈ R

X(t) = A cos(2πt + Θ)

Or, making the dependence on ω explicit,

X(t,ω) = A(ω) cos(2πt + Θ(ω))


Derived distributions for random variables

General problem: Given probability space (Ω,F , P) and a randomvariable X with range space (alphabet) ΩX. Find the distribution PX.

If ΩX is discrete, then PX described by a pmf

pX(x) = P(X−1(x)) = P(ω : X(ω) = x)

PX(F) =

x∈FpX(x) = P(X−1(F))

If ΩX is continuous, then need a pdf.

But a pdf is not a probability so inverse image formula does not applyimmediately⇒ alter approach


Cumulative distribution functions

Define cumulative distribution function (cdf) by

FX(x) ≡ x

−∞fX(r)dr = Pr(X ≤ x)

This is a probability and inverse image formula works

FX(x) = P(X−1((−∞, x]))

and from calculusfX(x) =

ddx

FX(x)

So first find cdf FX(x), then differentiate to find fX(x)


Notes:

• If a ≥ b, then since (−∞, a] = (−∞, b] ∪ (b, a] is the union of disjointintervals, then FX(a) = FX(b) + PX((b, a]) and hence

PX((a, b]) = b

afX(x) dx = FX(b) − FX(a)

⇒ FX(x) is monontonically nondecreasing

• cdf is well defined for discrete rvs:

FX(r) = Pr(X ≤ r) =

x:x≤r

pX(x),

but not as useful. Not needed for derived distributions


If original space (Ω,F , P) is a discrete probability space, then rv Xdefined on (Ω,F , P) is also discrete

Inverse image formula⇒

pX(x) = PX(x) = P(X−1(x)) =

ω:X(ω)=x

p(ω)


Example: discrete derived distribution

Ω = Z+, P determined by the geometric pmf

Define a random variable Y : Y(ω) =

1 if ω even0 if ω odd

Using the inverse image formula for the pmf for Y(ω) = 1:


pY(1) =

ω:ω even

(1 − p)k−1p =

k=2,4,...

(1 − p)k−1p

=p

(1 − p)

∞

k=1

((1 − p)2)k = p(1 − p)∞

k=0

((1 − p)2)k

= p(1 − p)

1 − (1 − p)2 =1 − p2 − p

pY(0) = 1 − pY(1) =1

2 − p


Suppose original space is (Ω,F , P) = (R,B(R), P) where P isdescribed by a pdf g:

P(F) =

r∈Fg(r) dr; F ∈ B(R).

X a rv. Inverse image formula⇒

PX(F) = P(X−1(F)) =

r: X(r)∈Fg(r) dr.

If X discrete, find the pmf pX(x) =

r: X(r)=xg(r) dr

Quantizer example did this.

If X is continuous, want the pdf. First find cdf then differentiate.


Example: continuous derived distribution

Square of a random variable

(R,B(R), P) with P induced by a Gaussian pdf.

Define W : R→ R by W(r) = r2; r ∈ R.

Find pdf fW. First find cdf FW, then differentiate. If w < 0, FW(w) = 0.If w ≥ 0,

FW(w) = Pr(W ≤ w) = P(ω : W(ω) = ω2 ≤ w)

= P([−w1/2,w1/2]) = w1/2

−w1/2g(r) dr

This can be complicated, but don’t need to plug in g yet


Use integral differentiation formula to get pdf directly —

ddw

b(w)

a(w)g(r) dr = g(b(w))

db(w)dw

− g(a(w))da(w)

dw

In our example

fW(w) = g(w1/2)w−1/2

2

− g(−w1/2)

−w−1/2

2

E.g., if g =N(0,σ2), then

fW(w) =w−1/2√

2πσ2e−w/2σ2

; w ∈ [0,∞).

— a chi-squared pdf with one degree of freedom


Example: continuous derived distribution

The max and min functions

Let X ∼ fX(x) and Y ∼ fY(y) be independent so thatfX,Y(x, y) = fX(x) fY(y).

DefineU = maxX,Y,V = minX,Y

where

max(x, y) =

x if x ≥ yy otherwise

min(x, y) =

y if x ≥ yx otherwise


Find the pdfs of U and V.

To find the pdf of U, we first find its cdf. U ≤ u iff both X and Y are≤ u, so using independence

FU(u) = Pr(U ≤ u) = Pr(X ≤ u,Y ≤ u) = FX(u)FY(u)

Using the product rule for derivatives,

fU(u) = fX(u)FY(u) + fY(u)FX(u)

To find the pdf of V, first find the cdf. V ≤ v iff either X or Y ≤ v so thatusing independence

FV(v) = Pr(X ≤ v or Y ≤ v)

= 1 − Pr(X > v,Y > v)

= 1 − (1 − FX(v))(1 − FY(v))


ThusfV(v) = fX(v) + fY(v) − fX(v)FY(v) − fY(v)FX(v)


Directly-given random variables

All named examples of pmfs (uniform, Bernoulli, binomial, geometric,Poisson) and pdfs (uniform, exponential, Gaussian, Laplacian,chi-squared, etc.) and the probability spaces they imply can beconsidered as describing random variables:

Suppose (Ω,F , P) is a probability space with Ω ⊂ R.

Define a random variable V : Ω→ Ω

V(ω) = ω

— the identity mapping, random variable just reports original samplevalue ω


Implies output probability space in trivial way:

PV(F) = P(V−1(F)) = P(F)

If original space discrete (continuous), so is random variable, andrandom variable is described by pmf (pdf)

A random variable is said to be Bernoulli, binomial, etc. if itsdistribution is determined by a Bernoulli, binomial, etc. pmf (or pdf)

Two random variables V and X (possibly defined on differentexperiments) are said to be equivalent or identically distributed ifPV = PX, i.e., PV(F) = PX(F) all events F

E.g., both continuous with same pdf, or both discrete with same pmf

Example: Binary random variable defined as quantization of fairspinner vs. directly given as above.


Note: Two ways to describe random variables:

1. Describe a probability space (Ω,F , P) and define a function X onit. Together these imply distribution PX for rv (by a pmf or pdf)

2. (Directly given) Describe distribution PX directly (by a pmf or pdf).

Implicitly (Ω,F , P) = (ΩX,B(ΩX), PX) and X(ω) = ω.

Both representations are useful.


Derived distributions: random vectors

As in the scalar case, distribution can be described by probabilityfunctions — cdf’s and either pmfs or pdfs (or both)

If random vector has a discrete range space, then the distribution canbe described by a multidimensional pmf pX(x) = PX(x) = Pr(X = x)as

PX(F) =

x∈FpX(x) =

(x0,x1,...,xk−1)∈FpX0,X1,...,Xk−1(x0, x1, . . . , xk−1)

If the random vector X has a continuous range space, thendistribution can be described by a multidimensional pdf fXPX(F) =

F fX(x) dx Use multidimensional cdf to find pdf


Given a k-dimensional random vector X, define cumulativedistribution function (cdf) FX by

FX(x)

= FX0,X1,...,Xk−1(x0, x1, . . . , xk−1)

= PX(α : αi ≤ xi; i = 0, 1, . . . , k − 1)= Pr(Xi ≤ xi; i = 0, 1, . . . , k − 1)

=

x0

−∞

x1

−∞· · · xk−1

−∞fX0,X1,...,Xk−1(α0,α1, . . . ,αk−1)dα0dα1 · · · dαk−1


Other ways to express multidimensional cdf:

FX(x) = PX×k−1

i=0 (−∞, xi]

= P(ω : Xi(ω) ≤ xi; i = 0, 1, . . . , k − 1)

= P

k−1

i=0

X−1i ((−∞, xi])

.

Integration and differentiation are inverses of each other⇒

fX0,X1,...,Xk−1(x0, x1, . . . , xk−1)

=∂k

∂x0∂x1 . . . ∂xk−1FX0,X1,...,Xk−1(x0, x1, . . . , xk−1).


Joint and marginal distributions

Random vector X = (X0, X1, . . . , Xk−1) is a collection of randomvariables defined on a common probability space (Ω,F , P)

Alternatively, X is a random vector that takes on values randomly asdescribed by a probability distribution PX, without explicit reference tothe underlying probability space.

Either the original probability measure P or the induced distributionPX can be used to compute probabilities of events involving therandom vector.

E.g., finding the distributions of individual components of the randomvector.


For example, if X = (X0, X1, . . . , Xk−1) is discrete, described by a pmfpX, then distribution for PX0 is described by pmf pX0(x0) which can becomputed as

pX0(x0) = P(ω : X0(ω) = x0)= P(ω : X0(ω) = x0, Xi(ω) ∈ ΩX; i = 1, 2, . . . , k − 1)=

x1,x2,...,xk−1

pX(x0, x1, x2, . . . , xk−1)

In English, all of these are Pr(X0 = x0)

In general we have for cdfs that

FX0(x0) = P(ω : X0(ω) ≤ x0)

= P(ω : X0(ω) ≤ x0, Xi(ω) ∈ ΩX; i = 1, 2, . . . , k − 1)= FX(x0,∞,∞, . . . ,∞)


⇒ if the pdfs exist,

fX0(x0) =

fX(x0, x1, x2, . . . , xk−1)dx1dx2 . . . dxk−1

Can find distributions for any of the components in this way:

pXi(α)

=

x0,x1,...,xi−1,xi+1,...,xk−1

pX0,X1,...,Xk−1(x0, x1, . . . , xi−1,α, xi+1, . . . , xk−1)

or

fXi(α) =

dx0 . . . dxi−1dxi+1 . . . dxk−1 fX0,...,Xk−1(x0, . . . , xi−1,α, xi+1, . . . , xk−1)


Sum or integrate over all of the dummy variables corresponding tothe unwanted random variables in the vector to obtain the pmf or pdffor the random variable Xi

FXi(α) = FX(∞,∞, . . . ,∞,α,∞, . . . ,∞),or Pr(Xi ≤ α) = Pr(Xi ≤ α and Xj ≤ ∞, all j i)

Similarly can find cdfs/pmfs/pdfs for any pairs or triples of randomvariables in the random vector or any other subvector (at least intheory)

These relations are called consistency relationships — a randomvector distribution implies many other distributions, and these mustbe consistent with each other.


2D random vectors

Ideas are clearest when only 2 rvs: (X,Y) a random vector.

marginal distribution of X is obtained from the joint distribution of Xand Y by leaving Y unconstrained

PX(F) = PX,Y((x, y) : x ∈ F, y ∈ R); F ∈ B(R).

Marginal cdf of X is FX(α) = FX,Y(α,∞)

If the range space of the vector (X,Y) is discrete,

pX(x) =

y

pX,Y(x, y).


If the range space of the vector (X,Y) is continuous and the cdf isdifferentiable so that fX,Y(x, y) exists,

fX(x) = ∞

−∞fX,Y(x, y) dy,

with similar expressions for the distribution for rv Y.

Joint distributions imply marginal distributions.

The opposite is not true without additional assumptions, e.g.,independence.


Examples of joint and marginal distributions

Example

Suppose rvs X and Y are such that the random vector (X,Y) has apmf of the form

pX,Y(x, y) = r(x)q(y),

where r and q are both valid pmfs. (pX,Y is a product pmf)

Then

pX(x) =

y

pX,Y(x, y) =

y

r(x)q(y)

= r(x)

y

q(y) = r(x).


Thus in the special case of a product distribution, knowing themarginal pmfs is enough to know the joint distribution. Thus marginaldistributions + independence⇒ the joint distribution.

Pair of fair coins provides an example:

pXY(x, y) = pX(x)pY(y) =14

; x, y = 0, 1

pX(x) = pY(y) =12

; x = 0, 1


Example of where marginals not enough

Flip two fair coins connected by a piece of flexible rubber

pXY(x, y)0 1

0 0.4 0.11 0.1 0.4

⇒ pX(x) = pY(y) = 1/2, x = 0, 1

Not a product distribution, but same marginals as product distributioncase

Quite different joints can yield the same marginals. Marginals alonedo not tell the story.


Another example

Loaded pair of six-sided dice have property the sum of the two dice =7 on every roll.

All 6 combinations possible combinations ( (1,6), (2,5), (3,4), (4,3),(5,2), (6,1)) have equal probability.

Suppose outcome of one die is X, other is Y

(X,Y) is a random vector taking values in 1, 2, . . . , 62

pX,Y(x, y) =16, x + y = 7, (x, y) ∈ 1, 2, . . . , 62.

Find marginal pmfs


pX(x) =

y

pXY(x, y) = pXY(x, 7 − x) =16, x = 1, 2, . . . , 6

Same as if product distribution. marginals alone do not imply joint


Continuous example

(X,Y) a rv with a pdf that is constant on the unit disk in the XY plane:

fX,Y(x, y) =

C x2 + y2 ≤ 10 otherwise

Find marginal pdfs. Is it a product pdf?

Need C:

x2+y2≤1C dx dy = 1.

Integral = area of a circle multiplied by C ⇒ C = 1/π.


fX(x) = +√1−x2

−√

1−x2C dy = 2C

√1 − x2 , x2 ≤ 1.

Could now also find C by a second integration:

+1

−12C√

1 − x2 dx = πC = 1,

or C = π−1.

ThusfX(x) = 2π−1

√1 − x2 , x2 ≤ 1.

By symmetry Y has the same pdf. fX,Y not a product pdf.

Note marginal pdf is not constant, even though the joint pdf is.


Joints and marginals: Gaussian pair

2D Gaussian pdf with k = 2, m = (0, 0)t, andΛ = λ(i, j) : λ(1, 1) = λ(2, 2) = 1, λ(1, 2) = λ(2, 1) = ρ.Inverse matrix is

1 ρρ 1

−1

=1

1 − ρ2

1 −ρ−ρ 1

,

the joint pdf for the random vector (X,Y) is

fX,Y(x, y) =exp− 1

2(1−ρ2)(x2 + y2 − 2ρxy)

2π

1 − ρ2, (x, y) ∈ R2.

ρ called “correlation coefficient”


Need ρ2 < 1 for Λ to be positive definite

To find the pdf of X, integrate joint over y

Do this using standard trick: complete the square:

x2 + y2 − 2ρxy = (y − ρx)2 − ρ2x2 + x2 = (y − ρx)2 + (1 − ρ2)x2

fX,Y(x, y) =exp−(y−ρx)2

2(1−ρ2) −x2

2

2π

1 − ρ2=

exp−(y−ρx)2

2(1−ρ2)

2π(1 − ρ2)

exp−x2

2

√2π.

Part of joint is N(ρx, 1 − ρ2), which integrates to 1. Thus

fX(x) = (2π)−1/2e−x2/2.

Note marginals the same regardless of ρ!


Consistency & directly given processes

Have seen two ways to describe (specify) a random variable – as aprobability space + a function (random variable), or a directly given rv(a distribution — pdf or pmf)

Same idea works for random vectors.

What about random processes? E.g., direct definition of fair coinflipping process.


For simplicity, consider discrete time, discrete alphabet randomprocess, say Xn. Given random process, can use inverse imageformula to compute pmf for any finite collection of samples(Xk1, Xk2, . . . , XkK), e.g.,

pXk1,Xk2,...,XkK(x1, x2, . . . , xK) = Pr(Xki = xi; i = 1, . . . ,K)

= P(ω : Xki(ω) = xi; i = 1, . . . ,K)

For example, in the fair coin flipping process

pXk1,Xk2,...,XkK(x1, x2, . . . , xK) = 2−K, all (x1, x2, . . . , xK) ∈ 0, 1K


The axioms of probability⇒ that these pmfs for any choice of K andk1, . . . , kK must be consistent in the sense that if any of the pmfs isused to compute the probability of an event, the answer must be thesame. E.g.,

pX1(x1) =

x2

pX1,X2(x1, x2)

=

x0,x2

pX0,X1,X2(x0, x1, x2)

=

x3,x5

pX1,X3,X5(x0, x2, x5)

since all of these computations yield the same probability in theoriginal probability space Pr(X1 = x1) = P(ω : X1(ω) = x1)


Bottom line If given a discrete time discrete alphabet randomprocess Xn; n ∈ Z, then for any finite K and collection of K sampletimes k1, . . . , kK can find the joint pmf pXk1,Xk2,...,XkK

(x1, x2, . . . , xK) andthis collection of pmfs must be consistent.

Kolmogorov proved a converse to this idea now called theKolmogorov extension theorem, which provides the most commonmethod for describing a random process:

Theorem. Kolmogorov extension theorem for discrete timeprocesses Given a consistent family of finite-dimensional pmfspXk1,Xk2,...,XkK

(x1, x2, . . . , xK) for all dimensions K and sample timesk1, . . . , kK, then there is a random process Xn; n ∈ Z described bythese marginals.


To completely describe a random process, you need only provide aformula for a consistent family of pmfs for finite collections ofsamples.

The same result holds for continuous time random processes and forcontinuous alphabet processes (family of pdfs)

Difficult to prove, but most common way to specify model.Kolmogorov or directly-given representation of a random process –describe consistent family of vector distributions. For completeness:


Theorem. Kolmogorov extension theoremSuppose that one is given a consistent family of finite-dimensionaldistributions PXt0,Xt1,...,Xtk−1

for all positive integers k and all possiblesample times ti ∈ T ; i = 0, 1, . . . , k − 1. Then there exists a randomprocess Xt; t ∈ T that is consistent with this family. In other words,to describe a random process completely, it is sufficient to describe aconsistent family of finite-dimensional distributions of its samples.

Example: Given a pmf p, define a family of vector pmfs by

pXk1,Xk2,...,XkK(x1, x2, . . . , xK) =

K

i=1

p(xk),

then there is a random process Xn having these vector pmfs forfinite collections of samples. A process of this form is called an iidprocess.


The continuous alphabet analogy is defined in terms of a pdf f —define the vector pdfs by

fXk1,Xk2,...,XkK(x1, x2, . . . , xK) =

K

i=1

f (xk)

A discrete time continuous alphabet process is iid if its joint pdfsfactor in this way.


Independent random variables

Return to definition of independent rvs, with more explanation.

Definition of independent random variables an application ofdefinition of independent events.

Defined events F and G to be independent if P(F ∩G) = P(F)P(G)

Two random variables X and Y defined on a probability space areindependent if the events X−1(F) and Y−1(G) are independent for allF and G in B(R), i.e., if

P(X−1(F) ∩ Y−1(G)) = P(X−1(F))P(Y−1(G))

Equivalently, Pr(X ∈ F,Y ∈ G) = Pr(X ∈ F) Pr(Y ∈ G) orPXY(F ×G) = PX(F)PY(G)


If X, Y discrete, choosing F = x, Y = y⇒

pXY(x, y) = pX(x)pY(y) all x, y

Conversely, if joint pmf = product of marginals, then evaluatePr(X ∈ F,Y ∈ G) as

P(X−1(F) ∩ Y−1(G)) =

x∈F,y∈GpXY(x, y) =

x∈F,y∈GpX(x)pY(y)

=

x∈FpX(x)

y∈GpY(y)

= P(X−1(F))P(Y−1(G))

⇒ independent by general definition


For general random variables, consider F = (−∞, x], G = (−∞, y].Then if X,Y independent, FXY(x, y) = FX(x)FY(y) all x, y. If pdfs exist,this implies that

fXY(x, y) = fX(x) fY(y)

Conversely, if this relation holds for all x, y, thenP(X−1(F) ∩ Y−1(G)) = P(X−1(F))P(Y−1(G)) and hence X and Y areindependent.


A collection of rvs Xi, i = 0, 1, . . . , k − 1 is independent or mutuallyindependent if all collections of events of the formX−1

i (Fi); i = 0, 1, . . . , k − 1 are mutually independent for anyFi ∈ B(R); i = 0, 1, . . . , k − 1.

A collection of discrete random variables Xi; i = 0, 1, . . . , k − 1 ismutually independent iff

pX0,...,Xk−1(x0, . . . , xk−1) =k−1

i=0

pXi(xi); ∀xi.

A collection of continuous random variables is independent iff thejoint pdf factors as

fX0,...,Xk−1(x0, . . . , xk−1) =k−1

i=0

fXi(xi).


A collection of general random variables is independent iff the jointcdf factors as

FX0,...,Xk−1(x0, . . . , xk−1) =k−1

i=0

FXi(xi); (x0, x1, . . . , xk−1) ∈ Rk.

The random vector is independent, identically distributed (iid) if thecomponents are independent and the marginal distributions are allthe same.


Conditional distributions

Apply conditional probability to distributions.

Can express joint probabilities as products even if rvs notindependent

E.g., distribution of input given observed output (for inference)

There are many types: conditional pmfs, conditional pdfs, conditionalcdfs

Elementary and nonelementary conditional probability


Discrete conditional distributions

Simplest, direct application of elementary conditional probability topmfs

Consider 2D discrete random vector (X,Y)

alphabet AX × AY

joint pmf pX,Y(x, y)

marginal pmfs pX and pY


Define for each x ∈ AX for which pX(x) > 0 the conditional pmf

pY |X(y|x) = P(Y = y|X = x)

=P(Y = y, X = x)

P(X = x)

=P(ω : Y(ω) = y ∩ ω : X(ω) = x)

P(ω : X(ω) = x)

=pX,Y(x, y)

pX(x),

elementary conditional probability that Y = y given X = x


Properties of conditional pmfs

For fixed x, pY |X(·|x) is a pmf:

y∈AY

pY |X(y|x) =

y∈AY

pX,Y(x, y)pX(x)

=1

pX(x)

y∈AY

pX,Y(x, y)

=1

pX(x)pX(x) = 1.

The joint pmf can be expressed as a product as

pX,Y(x, y) = pY |X(y|x)pX(x).


Can compute conditional probabilities by summing conditional pmfs,

P(Y ∈ F|X = x) =

y∈FpY |X(y|x)

Can write probabilities of events of the form X ∈ G,Y ∈ F (rectangles)as

P(X ∈ G,Y ∈ F) =

x,y:x∈G,y∈FpX,Y x, y

=

x∈GpX(x)

y∈FpY |X(y | x)

=

x∈GpX(x)P(F | X = x)

Later: define nonelementary conditional probability to mimic thisformula


If X and Y are independent, then pY |X(y|x) = pY(y)

Given pY |X, pX, Bayes rule for pmfs:

pX|Y(x|y) =pX,Y(x, y)

pY(y)=

pY |X(y|x)pX(x)

u pY |X(y|u)pX(u),

a result often referred to as Bayes’ rule.


Example of Bayes rule: Binary Symmetric Channel

Consider the following binary communication channel

X ∈ 0, 1 +

Z ∈ 0, 1

Y ∈ 0, 1

Bit sent is X ∼ Bern(p), 0 ≤ p ≤ 1, noise is Z ∼ Bern(), 0 ≤ ≤ 0.5,bit received is Y = (X + Z) mod 2 = X ⊕ Z, and X and Z areindependent

Find 1) pX|Y(x|y), 2) pY(y), and 3) PrX Y, the probability of error


1.To find pX|Y(x|y) use Bayes rule

pX|Y(x|y) =pY |X(y|x)

x∈AX

pY |X(y|x)pX(x)pX(x)

Know pX(x), but we need to find pY |X(y|x):

pY |X(y|x) = PrY = y | X = x = PrX ⊕ Z = y | X = x= Prx ⊕ Z = y | X = x = PrZ = y ⊕ x | X = x= PrZ = y ⊕ x since Z and X are independent

= pZ(y ⊕ x)


Therefore

pY |X(0 | 0) = pZ(0 ⊕ 0) = pZ(0) = 1 − pY |X(0 | 1) = pZ(0 ⊕ 1) = pZ(1) =

pY |X(1 | 0) = pZ(1 ⊕ 0) = pZ(1) =

pY |X(1 | 1) = pZ(1 ⊕ 1) = pZ(0) = 1 −


Plugging into Bayes rule:

pX|Y(0|0) =pY |X(0|0)

pY |X(0|0)pX(0) + pY |X(0|1)pX(1)pX(0) =

(1 − )(1 − p)(1 − )(1 − p) + p

pX|Y(1|0) = 1 − pX|Y(0|0) =p

(1 − )(1 − p) + p

pX|Y(0|1) =pY |X(1|0)

pY |X(1|0)pX(0) + pY |X(1|1)pX(1)pX(0) =

(1 − p)(1 − )p + (1 − p)

pX|Y(1|1) = 1 − pX|Y(0|1) =(1 − )p

(1 − )p + (1 − p)


2.We already found pY(y) as

pY(y) = pY |X(y |0)pX(0) + pY |X(y |1)pX(1)

=

(1 − )(1 − p) + p for y = 0

(1 − p) + (1 − )p for y = 1

3.Now to find the probability of error PrX Y, consider

PrX Y = pX,Y(0, 1) + pX,Y(1, 0)

= pY |X(1|0)pX(0) + pY |X(0|1)pX(1)

= (1 − p) + p =


An interesting special case is = 12. Here, PrX Y = 1

2, which isthe worst possible (no information is sent), and

pY(0) = 12 p + 1

2(1 − p) = 12 = pY(1)

Therefore Y ∼ Bern(12), independent of the value of p !

In this case, the bit sent X and the bit received Y are independent(check this)


Conditional pmfs for vectors

Random vector (X0, X1, . . . , Xk−1)

pmf pX0,X1,...,Xk−1

Define conditional pmfs (assuming denominators 0)

pXl|X0,...,Xl−1(xl|x0, . . . , xl−1) =pX0,...,Xl(x0, . . . , xl)

pX0,...,Xl−1(x0, . . . , xl−1).


⇒ chain rule

pX0,X1,...,Xn−1(x0, x1, . . . , xn−1)

=

pX0,X1,...,Xn−1(x0, x1, . . . , xn−1)pX0,X1,...,Xn−2(x0, x1, . . . , xn−2)

pX0,X1,...,Xn−2(x0, x1, . . . , xn−2)

...

= pX0(x0)n−1

i=1

pX0,X1,...,Xi(x0, x1, . . . , xi)pX0,X1,...,Xi−1(x0, x1, . . . , xi−1)

= pX0(x0)n−1

l=1

pXl|X0,...,Xl−1(xl|x0, . . . , xl−1)

Formula plays an important role in characterizing memory inprocesses. Can be used to construct joint pmfs, and to specify arandom process.


Continuous conditional distributions

Continuous distributions more complicated

Given X,Y with joint pdf fX,Y, marginal pdfs fX, fY, define conditionalpdf

fY |X(y|x) ≡ fX,Y(x, y)fX(x)

.

analogous to conditional pmf, but unlike conditional pmf, not aconditional probability!

A density of conditional probability

Problem: conditioning event has probability 0. Elementaryconditional probability not work.


Conditional pdf is a pdf:

fY |X(y|x) dy =

fX,Y(x, y)fX(x)

dy

=1

fX(x)

fX,Y(x, y) dy

=1

fX(x)fX(x) = 1,

provided require that fX(x) > 0 over the region of integration.

Given a conditional pdf fY |X, define (nonelementary) conditionalprobability that Y ∈ F given X = x by

P(Y ∈ F|X = x) ≡

FfY |X(y|x) dy.

Resembles discrete form.


Nonelementary conditional probability

Does P(Y ∈ F|X = x) =

F fY |X(y|x) dy. make sense as an appropriatedefinition of conditional probability given an event of zero probability?

Observe that analogous to the ed result for pmfs, assuming thepdfs all make sense

P(X ∈ G,Y ∈ F) =

x,y:x∈G,y∈FfX,Y(x, y)dxdy

=

x∈GfX(x)

y∈FfY |X(y | x)dy

dx

=

x∈GfX(x)P(F | X = x)


Our definition is ad hoc. But the careful mathematical definition ofconditional probability P(F | X = x) for an event of 0 probability ismade not by a formula such as we have used to define conditionalpmfs and pdfs and elementary conditional probability, but by itsbehavior inside an integral (like the Dirac delta). In particular,P(F | X = x) is defined as any measurable function satisfyingequation for all events F and G, which our definition does.


Bayes rule for pdfs

Bayes rule:

fX|Y(x|y) =fX,Y(x, y)

fY(y)=

fY |X(y|x) fX(x)fY |X(y|u) fX(u) du

.

Example of conditional pdfs: 2D Gaussian

U = (X,Y), Gaussian pdf with mean (mX,mY)t and covariance matrix

Λ =

σ2

X ρσXσY

ρσXσY σ2Y

,


Algebra⇒

det(Λ) = σ2Xσ

2Y(1 − ρ2)

Λ−1 =1

(1 − ρ2)

1/σ2

X −ρ/(σXσY)−ρ/(σXσY) 1/σ2

Y

so

fXY(x, y)

=1

2π√

detΛe−

12(x−mX,y−mY)Λ−1(x−mX,y−mY)t

=1

2πσXσY

1 − ρ2exp− 1

2(1 − ρ2)

×

x − mX

σX

2− 2ρ

(x − mX)(y − mY)σXσY

+

y − mY

σY

2


Rearrange

fXY(x, y) =exp−1

2(x−mXσX

)2

2πσ2

X

exp−1

2

y−mY−(ρσY/σX)(x−mX)√

1−ρ2σY

2

2πσ2Y(1 − ρ2)

⇒

fY |X(y|x) =exp−1

2

y−mY−(ρσY/σX)(x−mX)√

1−ρ2σY

2

2πσ2Y(1 − ρ2)

,

Gaussian with variance σ2Y |X ≡ σ2

Y(1 − ρ2), meanmY |X ≡ mY + ρ(σY/σX)(x − mX)


Integrate joint over y (as before)⇒

fX(x) =e−(x−mX)2/2σ2

X

2πσ2X

.

Similarly, fY(y) and fX|Y(x|y) are also Gaussian

Note: X and Y jointly Gaussian⇒ also both individually andconditionally Gaussian!


Chain rule for pdfs

Assume fX0,X1,...,Xi(x0, x1, . . . , xi) > 0,

fX0,X1,...,Xn−1(x0, x1, . . . , xn−1)

=fX0,X1,...,Xn−1(x0, x1, . . . , xn−1)fX0,X1,...,Xn−2(x0, x1, . . . , xn−2)

fX0,X1,...,Xn−2(x0, x1, . . . , xn−2)

...

= fX0(x0)n−1

i=1

fX0,X1,...,Xi(x0, x1, . . . , xi)fX0,X1,...,Xi−1(x0, x1, . . . , xi−1)

= fX0(x0)n−1

i=1

fXi|X0,...,Xi−1(xi|x0, . . . , xi−1).


Statistical detection and classification

Simple application of conditional probability mass functionsdescribing discrete random vectors

Transmitted: discrete rv X, pmf pX, pX(1) = p

(e.g., one sample of a binary random process)

Received: rv Y

Conditional pmf (noisy channel) pY |X(y|x)

More specific example as special case: X Bernoulli, parameter p

pY |X(y|x) =

x y1 − x = y

.


binary symmetric channel (BSC)

Given observation Y, what is the best guess X(Y) of transmittedvalue?

decision rule or detection rule

Measure quality by probability guess is correct:

Pc(X) = Pr(X = X(Y)) = 1 − Pe,

wherePe(X) = Pr(X(Y) X).

A decision rule is optimal if it yields the smallest possible Pe ormaximum possible Pc


Pr(X = X) = 1 − Pe(X) =

(x,y):X(y)=x

pX,Y(x, y)

=

(x,y):X(y)=x

pX|Y(x|y)pY(y)

=

y

pY(y)

x:X(y)=x

pX|Y(x|y)

=

y

pY(y)pX|Y(X(y)|y).

To maximize sum, maximize pX|Y(X(y)|y) for each y.

Accomplished by X(y) ≡ arg maxu

pX|Y(u|y) which yields

pX|Y(X(y)|y) = maxu pX|Y(u|y)


This is maximum a posteriori (MAP) detection rule

In binary example: Choose X(y) = y if < 1/2 and X(y) = 1 − y if > 1/2.

⇒ minimum (optimal) error probability over all possible rules ismin(, 1 − )

In general nonbinary case, statistical detection is statisticalclassification: Unseen X might be presence or absence of a disease,observation Y the results of various tests.

General Bayesian classification allows weighting of cost of differentkinds of errors (Bayes risk) so minimize a weighted average(expected cost) instead of only probability of error


Additive noise: Discrete random variables

Common setup in communications, signal processing, statistics:

Original signal X has random noise W (independent of X) added to it,observe Y = X +W

Typically use observation Y to make inference about X

Begin by deriving conditional distributions.

Discrete case: Have independent rvs X and W with pmfs pX and pW.Form Y = X +W. Find pY

Use inverse image formula:


pX,Y(x, y) = Pr(X = x,Y = y) = Pr(X = x, X +W = y)

=

α,β:α=x,α+β=y

pX,W(α, β) = pX,W(x, y − x)

= pX(x)pW(y − x).

Note: Formula only makes sense if y − x is in the range space of W

ThuspY |X(y|x) =

pX,Y(x, y)pX(x)

= pW(y − x),

Intuitive!

Marginal for Y:

pY(y) =

x

pX,Y(x, y) =

x

pX(x)pW(y − x)


a discrete convolution

Above uses ordinary real arithmetic. Similar results hold for otherdefinitions of addition, e.g., modulo 2 arithmetic for binary

As with linear systems, convolutions usually be easily evaluated inthe transform domain. Will do shortly.


Additive noise: continuous random variables

X, W, fXW(x,w) = fX(x) fW(w) (independent), Y = X +W

Find fY |X and fY

Since continuous, find joint pdf by first finding joint cdf

FX,Y(x, y) = Pr(X ≤ x,Y ≤ y) = Pr(X ≤ x, X +W ≤ y)

=

α,β:α≤x,α+β≤yfX,W(α, β) dα dβ

=

x

−∞dα y−α

−∞dβ fX(α) fW(β)

=

x

−∞dα fX(α)FW(y − α).


Taking derivatives:

fX,Y(x, y) = fX(x) fW(y − x)

⇒fY |X(y|x) = fW(y − x).

⇒fY(y) =

fX,Y(x, y) dx =

fX(x) fW(y − x) dx,

a convolution integral of the pdfs fX and fW

pdf fX|Y follows from Bayes’ rule:

fX|Y(x|y) =fX(x) fW(y − x)

fX(α) fW(y − α) dα.

Gaussian example:


Additive Gaussian noise

Assume fX = N(0,σX), fW = N(0,σ2Y), fXW(x,w) = fX(x) fW(w),

Y = X +W.

fY |X(y|x) = fW(y − x) =e−(y−x)2/2σ2

W

2πσ2W

which is N(x,σ2W).


To find fX|Y using Bayes’ rule, need fY:

fY(y) = ∞

−∞fY |X(y|α) fX(α) dα

=

∞

−∞

exp− 1

2σ2W

(y − α)2

2πσ2

W

exp− 1

2σ2Xα2

2πσ2

X

dα

=1

2πσXσW

∞

−∞exp−

12

y2 − 2αy + α2

σ2W

+α2

σ2X

dα

=

exp− y2

2σ2W

2πσXσW

∞

−∞exp−

12

α2(

1σ2

X+

1σ2

W) − 2αyσ2

W

dα

Can integrate by completing the square (later see an easier wayusing tranforms, but this trick is not difficult)


Integrand resembles

exp−1

2(α − mσ

)2.

which has integral

∞

−∞exp−1

2(α − mσ

)2

dα =√

2πσ2

(Gaussian pdf integrates to 1)

Compare

−12

α

2

1σ2

X+

1σ2

W

−

2αyσ2

W

vs. − 1

2

α − mσ

2= −1

2

α2

σ2 − 2αmσ2+

m2

σ2

.


The braced terms will be the same if choose

1σ2 =

1σ2

W+

1σ2

X⇒ σ2 =

σ2Xσ

2W

σ2X + σ

2W,

and

yσ2

W=

mσ2 ⇒ m =

σ2

σ2W

y.

⇒

α2

1σ2

X+

1σ2

W

−

2αyσ2

W=α − mσ

2− m2

σ2

“completing the square.’


⇒

∞

−∞exp−

12

α2(

1σ2

X+

1σ2

W) − 2αyσ2

W

dα

=

∞

−∞exp−1

2

α − mσ2

2− m2

σ2

dα =

√2πσ2 exp

m2

2σ2

⇒

fY(y) =exp−1

2y2

σ2W

2πσXσW

√2πσ2 exp

m2

2σ2

=

exp−1

2y2

σ2X+σ

2W

2π(σ2

X + σ2W)

So fY = N(0,σ2X + σ

2W)

Sum of two independent 0 mean Gaussian rvs is another 0 meanGaussian rv, the variance of the sum = sum of the variances


For a posteriori probability fX|Y use Bayes’ rule + algebra

fX|Y(x|y) = fY |X(y|x) fX(x)/ fY(y)

=

exp− 1

2σ2W

(y − x)2

2πσ2

W

exp− 1

2σ2X

x2

2πσ2

X

/exp−1

2y2

σ2X+σ

2W

2π(σ2

X + σ2W)

=

exp−1

2

y2−2yx+x2

σ2W+ x2

σ2X− y2

σ2X+σ

2W

2πσ2

Xσ2W/(σ

2X + σ

2W)

=

exp− 1

2σ2Xσ

2W/(σ

2X+σ

2W)

(x − yσ2X/(σ

2X + σ

2W))2

2πσ2

Xσ2W/(σ

2X + σ

2W)

.


fX|Y(x|y) = Nσ2

X

σ2X + σ

2W

y,σ2

Xσ2W

σ2X + σ

2W

.

The mean of a conditional distribution called a conditional mean, thevariance of a conditional distribution called a conditional variance


Continuous additive noise with discrete input

Most important case of mixed distributions in communicationsapplications

Typical: Binary random variable X, Gaussian random variable W, Xand W independent, Y = X +W

Previous examples do not work, one rv discrete, other continuous

Similar signal processing issue: Observe Y, guess X

As before, may be one sample of a random process, in practice haveXn, Wn, Yn. At time n, observe Yn, guess Xn

Conditional cdf FY |X(y|x) for Y given X = x is an elementaryconditional probability. Analogous to purely discrete and purely


continuous cases

FY |X(y|x) = Pr(Y ≤ y | X = x) = Pr(X +W ≤ y | X = x)

= Pr(x +W ≤ y | X = x) = Pr(W ≤ y − x | X = x)

= Pr(W ≤ y − x) = FW(y − x)

Differentiating,

fY |X(y|x) =ddy

FY |X(y|x) =ddy

FW(y − x) = fW(y − x)


Joint distribution combined by a combination of pmf and pdf.

Pr(X ∈ F and Y ∈ G) =

F

pX(x)

GfY |X(y|x) dy

=

F

pX(x)

GfW(y − x) dy.

Choosing F = R yields

Pr(Y ∈ G) =

pX(x)

GfY |X(y|x) dy

=

pX(x)

GfW(y − x) dy.

Choosing G = (−∞, y] yields cdf FY(y)⇒

fY(y) =

pX(x) fY |X(y|x) =

pX(x) fW(y − x),


a convolution, analogous to pure discrete and pure continuous cases

Continuing analogy Bayes’ rule suggests conditional pmf:

pX|Y(x|y) =fY |X(y|x)pX(x)

fY(y)=

fY |X(y|x)pX(x)α pX(α) fY |X(y|α)

,

but this is not an elementary conditional probability, conditioningevent has probability 0!

Can be justified in similar way to conditional pdfs:

Pr(X ∈ F and Y ∈ G) =

Gdy fY(y) Pr(X ∈ F|Y = y)

=

Gdy fY(y)

F

pX|Y(x|y)


so that pX|Y(x|y) satisfies

Pr(X ∈ F|Y = y) =

F

pX|Y(x|y)

Apply to binary input and Gaussian noise: the conditional pmf of thebinary input given the noisy observation is

pX|Y(x|y) =fW(y − x)pX(x)

fY(y)

=fW(y − x)pX(x)α pX(α) fW(y − α)

; y ∈ R, x ∈ 0, 1.

Can now solve classical binary detection in Gaussian noise.


Binary detection in Gaussian noise

The derivation of the MAP detector or classifier extends immediatelyto a binary input random variable and independent Gaussian noise

As in the purely discrete case, MAP detector X(y) of X given Y = y isgiven by

X(y) = argmaxx

pX|Y(x|y) = argmaxx

fW(y − x)pX(x)α pX(α) fW(y − α)

.

Denominator of the conditional pmf does not depend on x, thedenominator has no effect on the maximization

X(y) = argmaxx

pX|Y(x|y) = argmaxx

fW(y − x)pX(x).


Assume for simplicity that X is equally likely to be 0 or 1:

X(y) = argmaxx

pX|Y(x|y) = argmaxx

1

2πσ2W

exp−

12

(x − y)2

σ2W

= argmaxx

pX|Y(x|y) = argminx|x − y|

Minimum distance or nearest neighbor decision, choose closest x to y

X(y) =

0 y < 0.51 y > 0.5

.

A threshold detector


Error probability of optimal detector:

Pe = Pr(X(Y) X)

= Pr(X(Y) 0|X = 0)pX(0) + Pr(X(Y) 1|X = 1)pX(1)

= Pr(Y > 0.5|X = 0)pX(0) + Pr(Y < 0.5|X = 1)pX(1)

= Pr(W + X > 0.5|X = 0)pX(0) + Pr(W + X < 0.5|X = 1)pX(1)

= Pr(W > 0.5|X = 0)pX(0) + Pr(W + 1 < 0.5|X = 1)pX(1)

= Pr(W > 0.5)pX(0) + Pr(W < −0.5)pX(1)

using the independence of W and X. In terms of Φ function:

Pe =12

1 − Φ

0.5σW

+ Φ

−0.5σW

= Φ

− 1

2σW

.


Statistical estimation

In detection/classification problems, goal is to guess which of adiscrete set of possibilities is true. MAP rule is an intuitive solution.

Different if (X,Y) continuous, observe Y, and guess X.

Examples: X,W independent Gaussian, Y = X +W. What is bestguess of X given Y?

Xn is a continuous alphabet random process (perhaps Gaussian).Observe Xn−1. What is best guess for Xn? What if observeX0, X1, X2, . . . , Xn−1?

Quality criteria for discrete case no longer works, Pr(X(Y) = Y) = 0 ingeneral.


Will later introduce another quality measure (MSE) and optimize.

Now mention other approaches.

Examples of estimation or regression instead of detection


MAP Estimation

Mimic map detection, maximize conditional probability function

XMAP(y) = argmaxx fX|Y(x|y)Easy to describe, application of conditional pdfs + Bayes.

But can not argue “optimal” in sense of maximizing quality

Example: Gaussian signal plus noise

Found fX|Y(x|y) = Gaussian with mean yσ2X/(σ

2X + σ

2W)

Gaussian pdf maximized at its mean⇒ MAP estimate of X givenY = y is the conditional mean yσ2

X/(σ2X + σ

2W).


Maximum Likelihood Estimation

The maximum likelihood (ML) estimate of X given Y = the value of xthat maximizes the conditional pdf fY |X(y|x) (instead of the a posterioripdf fX|Y(x|y))

XML(y) = argmaxx

fY |X(y|x).

Advantage: Do not need to know prior fX and use Bayes to findfX|Y(x|y). Simple

In the Gaussian case, XML(Y) = y.

Will return to estimation when consider expectations in more detail.


Characteristic functions

When sum independent random variables, find derived distribution byconvolution of pmfs or pdfs

Can be complicated, avoidable using transforms as in linear systems

Summing independent random variables arises frequently in signalanalysis problems. E.g., iid random process Xk is put into a linearfilter to produce an output Yn =

nk=1 hn−kXk.

What is distribution of Yn?

n-fold convolution a mess. Describe shortcut.

Transforms of probability functions called characteristic functions.Variation on Fourier/Laplace transforms. Notation varies.


For discrete rv with pmf pX, define characteristic function MX

MX( ju) =

x

pX(x)e jux

where u is usually assumed to be real.

A discrete exponential transform. Sometimes φ, Φ, j not included.(∼ notational differences in Fourier transforms)

Alternative useful form: Recall definition of expectation of a randomvariable g defined on a discrete probability space described by a pmfg: E(g) =

ω p(ω)g(ω)

Consider probability space (ΩX,B(ΩX), PX) with PX described by pmfpX

This is directly-given representation for rv X, X is the identity functionon ΩX: X(x) = x


Define random variable g(X) on this space g(X)(x) = e jux. ThenE[g(X)] =

x

pX(x)e jux so that

MX( ju) = E[e juX]

Characteristic functions, like probabilities, can be viewed as specialcases of expectation

Resembles discrete time Fourier transform

Fν(pX) =

x

pX(x)e− j2πνx

and the z-transformZz(pX) =

x

pX(x)zx.


MX( ju) = F−u/2π(pX) = Ze ju(pX)

Properties of characteristic functions follow from those ofFourier/Laplace/z/exponential transforms.


Can recover pmf from MX by suitable inversion. E.g., givenpX(k); k ∈ ZN,

12π

π/2

−π/2MX( ju)e−iuk du =

12π

π/2

−π/2

x

pX(x)e jux

e−iuk du

=

x

pX(x)1

2π

π/2

−π/2e ju(x−k) du

=

x

pX(x)δk−x = pX(k).

But usually invert by inspection or from tables, avoid inversetransforms


Characteristic functions and summing independentrvs

Two independent random variables X, W with pmfs pX and pW andcharacteristic functions MX and MW

Y = X +W

To find characteristic function of Y

MY( ju) =

y

pY(y)e juy

use the inverse image formula

pY(y) =

x,w:x+w=y

pX,W(x,w)


to obtain

MY( ju) =

y

x,w:x+w=y

pX,W(x,w)

e juy =

y

x,w:x+w=y

pX,W(x,w)e juy

=

y

x,w:x+w=y

pX,W(x,w)e ju(x+w)

=

x,w

pX,W(x,w)e ju(x+w)

Last sum factors:

MY( ju) =

x,w

pX(x)pW(w)e juxe juw =

x

pX(x)e jux

w

pW(w)e juw

= MX( ju)MW( ju),

⇒ transform of the pmf of the sum of independent random variablesis the product of their transforms


Iterate:

Theorem 1. If Xi; i = 1, . . . ,N are independent random variableswith characteristic functions MXi, then the characteristic function ofthe random variable Y =

Ni=1 Xi is

MY( ju) =N

i=1

MXi( ju).

If the Xi are independent and identically distributed with commoncharacteristic function MX, then

MY( ju) = MNX ( ju).


Example: X Bernoulli with parameter p = pX(1) = 1 − pX(0)

MX( ju) =1

k=0

e juk pX(k) = (1 − p) + pe ju

Xi; i = 1, . . . , n iid Bernoulli random variables, Yn =n

k=1 Xi, then

MYn( ju) = [(1 − p) + pe ju]n

with binomial theorem⇒

MYn( ju) =n

k=0

pYn(k)e juk = ((1 − p) + pe ju)n

=

n

k=0

nk

(1 − p)n−k pk

pYn(k)

e juk ,


Uniqueness of transforms⇒

pYn(k) =

nk

(1 − p)n−k pk; k ∈ Zn+1.

Same idea works for continuous rvs

For a continous random variable X with pdf fX, define thecharacteristic function MX of the random variable (or of the pdf) as

MX( ju) =

fX(x)e jux dx.

As in the discrete case,

MX( ju) = Ee juX.


Relates to the continuous-time Fourier transform

Fν( fX) =

fX(x)e− j2πνx dx

and the Laplace transform

Ls( fX) =

fX(x)e−sx dx

byMX( ju) = F−u/2π( fX) = L− ju( fX)

Hence can apply results from Fourier/Laplace transform theory. E.g.,given a well-behaved density fX(x); x ∈ R MX( ju), can inverttransform

fX(x) =1

2π

∞

−∞MX( ju)e− jux du.


Consider again two independent random variables X and Y with pdfsfX and fW, characteristic functions MX and MW

Paralleling the discrete case,

MY( ju) = MX( ju)MW( ju).

Will later see simple and general proof.


As in the discrete case, iterating gives result for many independentrvs:

If Xi; i = 1, . . . ,N are independent random variables withcharacteristic functions MXi, then the characteristic function of therandom variable Y =

Ni=1 Xi is

MY( ju) =N

i=1

MXi( ju).

If the Xi are independent and identically distributed with commoncharacteristic function MX, then

MY( ju) = MNX ( ju).


Summing Independent Gaussian rvs

X ∼ N(m,σ2)

Characteristic function found by completing the square:

MX( ju) = E(e juX) = ∞

−∞

1(2πσ2)1/2e−(x−m)2/2σ2

e jux dx

=

∞

−∞

1(2πσ2)1/2e−(x2−2mx−2σ2 jux+m2)/2σ2

dx

=

∞

−∞

1(2πσ2)1/2e−(x−(m+ juσ2))2/2σ2

dx

e jum−u2σ2/2

= e jum−u2σ2/2.

Thus N(m,σ2)↔ e jum−u2σ2/2


Xi; i = 1, . . . , n iid Gaussian random variables with pdfs N(m,σ2)

Yn =n

k=1 Xi

ThenMYn( ju) = [e jum−u2σ2/2]n = e ju(nm)−u2(nσ2)/2,

= characteristic function of N(nm, nσ2)

Moral: Use characteristic functions to derive distributions of sums ofindependent rvs.


Gaussian random vectors

A random vector is Gaussian if its density is Gaussian

Component rvs are jointly Gaussian

Description is complicated, but many nice properties

Multidimensional characteristic functions help derivation

Random vector X = (X0, . . . , Xn−1)

vector argument u = (u0, . . . , un−1)


n-dimensional characteristic function:

MX( ju) = MX0,...,Xn−1( ju0, . . . , jun−1) = Ee jutX

= E

exp

j

n−1

k=0

ukXk

Can be shown using multivariable calculus: Gaussian rv with meanvector m and covariance matrix Λ has characteristic function

MX( ju) = e jutm−utΛu/2

= exp

j

n−1

k=0

ukmk − 1/2n−1

k=0

n−1

m=0

ukΛ(k,m)um

Same basic form as Gaussian pdf, but depends directly on Λ, not Λ−1


So exists more generally, only need Λ to be nonnegative definite(instead of strictly positive definite). Define Gaussian rv moregenerally as a rv having a characteristic function of this form (inversetransform will have singularities)


Further examples of random processes:

Have seen two ways to define rps: Indirectly in terms of anunderlying probability space or directly (Kolmogorov representation)by describing consistent family of joint distributions (via pmfs, pdfs, orcdfs).

Used to define discrete time iid processes and processes which canbe constructed from iid processes by coding or filtering.

Introduce more classes of processes and develop some propertiesfor various examples.

In particular: Gaussian random processes and Markov processes


Gaussian random processes

A random process Xt; t ∈ T is Gaussian if the random vectors(Xt0, Xt1, . . . , Xtk−1) are Gaussian for all positive integers k and allpossible sample times ti ∈ T ; i = 0, 1, . . . , k − 1.

Works for continuous and discrete time.

Consistent family?

Yes if all mean vectors and covariance matrices drawn from acommon mean function m(t); t ∈ T and covariance functionΛ(t, s); t, s ∈ T ; i.e., for any choice of sample times t0, . . . , tk−1 ∈ Tthe random vector (Xt0, Xt1, . . . , Xtk−1) is Gaussian with mean(m(t0),m(t1), . . . ,m(tk−1)) and the covariance matrix isΛ = Λ(tl, t j); l, j ∈ Zk.EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 132

Gaussian random processes in both discrete and continuous timeare extremely common in analysis of random systems and havemany nice properties.


Discrete time Markov processes

An iid process is memoryless because present independent of past.

A Markov process allows dependence on the past in a structuredway.

Introduce via example.


A binary Markov process

Xn; n = 0, 1, . . . is a Bernoulli process with

pXn(x) =

p x = 11 − p x = 0

,

p ∈ (0, 1) a fixed parameter

Since the pmf pXn(x), abbreviate to pX:

pX(x) = px(1 − p)1−x; x = 0, 1.


Since process iid

pXn(xn) =n−1

i=0

pX(xi) = pw(xn)(1 − p)n−w(xn),

where w(xn) = Hamming weight of the binary vector xn.

Let Xn be input to a device which produces an output binaryprocess Yn defined by

Yn =

Y0 n = 0Xn ⊕ Yn−1 n = 1, 2, . . .

,

where Y0 is a binary equiprobable random variable(pY0(0) = pY0(1) = 0.5), independent of all of the Xn and ⊕ is mod 2addition

(linear filter using mod 2 arithmetic)


Alternatively:

Yn =

1 if Xn Yn−1

0 if Xn = Yn−1.

This process is called a binary autoregressive process. As will beseen, it is also called the symmetric binary Markov process

Unlike Xn, Yn depends strongly on past values. Since p < 1/2, Yn ismore likely to equal Yn−1 than not

If p is small, Yn is likely to have long runs of 0s and 1s.

Task: Find joint pmfs for new process: pYn(yn) = Pr(Yn = yn)


Use inverse image formula:

pYn(yn) = Pr(Yn = yn)

= Pr(Y0 = y0,Y1 = y1,Y2 = y2, . . . ,Yn−1 = yn−1)

= Pr(Y0 = y0, X1 ⊕ Y0 = y1, X2 ⊕ Y1 = y2, . . . , Xn−1 ⊕ Yn−2 = yn−1)

= Pr(Y0 = y0, X1 ⊕ y0 = y1, X2 ⊕ y1 = y2, . . . , Xn−1 ⊕ yn−2 = yn−1)

= Pr(Y0 = y0, X1 = y1 ⊕ y0, X2 = y2 ⊕ y1, . . . , Xn−1 = yn−1 ⊕ yn−2)

= pY0,X1,X2,X3,...,Xn−1(y0, y1 ⊕ y0, y2 ⊕ y1, . . . , yn−1 ⊕ yn−2)

= pY0(y0)n−1

i=1

pX(yi ⊕ yi−1).

Used the facts that (1) a ⊕ b = c iff a = b ⊕ c, (2) Y0, X1, X2, . . . , Xn−1

mutually independent, and (3) Xn are iid.


Plug in specific forms of pY0 and pX ⇒

pYn(yn) =12

n−1

i=1

pyi⊕yi−1(1 − p)1−yi⊕yi−1.

Marginal pmfs for Yn evaluated by summing out the joints (totalprobability), e.g.,

pY1(y1) =

y0

pY0,Y1(y0, y1) =12

y0

py1⊕y0(1 − p)1−y1⊕y0

=12

; y1 = 0, 1.

In a similar fashion it can be shown that the marginals for Yn are allthe same:

pYn(y) =12

; y = 0, 1; n = 0, 1, 2, . . .


Hence drop subscript and abbreviate pmf to pY

Note: Would not be the same with different initialization, e.g., Y0 = 1

Unlike the iid Xn process

pYn(yn) n−1

i=0

pY(yi)

(provided p 1/2)

Yn not iid

Joint not product of marginals, but can use chain rule with conditionalprobabilities to write as product of conditional pmfs, given by

pYl|Y0,Y1,...,Yl−1(yl|y0, y1, . . . , yl−1) =pYl+1(yl+1)

pYl(yl)= pX(yl ⊕ yl−1)


Note: Conditional probability of current output Yl given entire pastYi; i = 0, 1, . . . , l − 1 depends only on the most recent past outputYl−1! This property can be summarized nicely by also deriving theconditional pmf

pYl|Yl−1(yl|yl−1) =pYl−1,Yl(yl, yl−1)

pYl−1(yl−1)

= pyl⊕yl−1(1 − p)1−yl⊕yl−1

⇒pYl|Y0,Y1,...,Yl−1(yl|y0, y1, . . . , yl−1) = pYl|Yl−1(yl|yl−1).

A discrete time random process with this property is called a Markovprocess or Markov chain

The binary autoregressive process is a Markov process!


The binomial counting process

Next filter binary Bernoulli process using ordinary arithmetic.

Xn iid binary random process with marginal pmfpX(1) = p = 1 − pX(0).

Yn =

Y0 = 0 n = 0n

k=1 Xk = Yn−1 + Xn n = 1, 2, . . ..

Yn = output of a discrete time time-invariant linear filter with Kroneckerdelta response hk given by hk = 1 for k ≥ 0 and hk = 0 otherwise.

By definition,

Yn = Yn−1 or Yn = Yn−1 + 1; n = 2, 3, . . .


A discrete time process with this property is called a countingprocess. Will later see a continuous time counting process whichalso can only increase by 1

To completely describe this process need a formula for the joint pmfs

pY1,...,Yn(y1, . . . , yn) = pY1(y1)n

l=1

pYl|Y1,...,Yl−1(yl|y1, . . . , yl−1)

Already found marginal pmf pYn(k) using transforms to be binomial⇒binomial counting process

Find conditional pmfs, which imply joints via chain rule.


pYn|Yn−1,...,Y1(yn|yn−1, . . . , y1)

= Pr(Yn = yn|Yl = yl; l = 1, . . . , yn−1)

= Pr(Xn = yn − yn−1|Yl = yl; l = 1, . . . , n − 1)

= Pr(Xn = yn − yn−1|X1 = y1, Xi = yi − yi−1; i = 2, 3, . . . , n − 1)

Follows since since conditioning event Yi = yi; i = 1, 2, . . . , n − 1 isthe event X1 = y1, Xi = yi − yi−1; i = 2, 3, . . . , n − 1 and, given thisevent, the event Yn = yn is the event Xn = yn − yn−1.

Thus

pYn|Yn−1,...,Y1(yn|yn−1, . . . , y1)

= pXn|Xn−1,...,X2,X1(yn − yn−1|yn−1 − yn−2, . . . , y2 − y1, y1)


Xn iid⇒pYn|Yn−1,...,Y1(yn|yn−1, . . . , y1) = pX(yn − yn−1)

Hence chain rule + definition y0 = 0⇒

pY1,...,Yn(y1, . . . , yn) =n

i=1

pX(yi − yi−1)

For binomial counting process, use Bernoulli pX:

pY1,...,Yn(y1, . . . , yn) =n

i=1

p(yi−yi−1)(1 − p)1−(yi−yi−1),

whereyi − yi−1 = 0 or 1, i = 1, 2, . . . , n; y0 = 0.


Similar derivation⇒

pYn|Yn−1(yn|yn−1) = Pr(Yn = yn|Yn−1 = yn−1)= Pr(Xn = yn − yn−1|Yn−1 = yn−1).

Conditioning event, depends only on values of Xk for k < n, hencepYn|Yn−1(yn|yn−1) = pX(yn − yn−1) ⇒ Yn is Markov

Similar derivation works for sum of iid rvs with any pmf pX to showthat

pYn|Yn−1,...,Y1(yn|yn−1, . . . , y1) = pYn|Yn−1(yn|yn−1)or, equivalently,

Pr(Yn = yn|Yi = yi ; i = 1, . . . , n − 1) = Pr(Yn = yn|Yn−1 = yn−1),

⇒ Markov


Discrete random walk

Slight variation: Let Xn be binary iid with alphabet 1,−1 andPr(Xn = −1) = p

Yn =

0 n = 0n

k=1 Xk n = 1, 2, . . .,

Also has autoregressive format

Yn = Yn−1 + Xn, n = 1, 2, . . .

Transform of the iid random variables is

MX( ju) = (1 − p)e ju + pe− ju,


binomial theorem⇒

MYn( ju) = ((1 − p)e ju + pe− ju)n

=

n

k=0

nk

(1 − p)n−k pk

e ju(n−2k)

=

k=−n,−n+2,...,n−2,n

n

(n − k)/2

(1 − p)(n+k)/2p(n−k)/2

pYn(k)

e juk.


⇒

pYn(k) =

n(n − k)/2

(1 − p)(n+k)/2p(n−k)/2 ,

k = −n,−n + 2, . . . , n − 2, n.

Note that Yn must be even or odd depending on whether n is even orodd. This follows from the nature of the increments.


The discrete time Wiener process

Xn iid N(0, ,σ2).

As with the counting process, define

Yn =

0 n = 0n

k=1 Xk n = 1, 2, . . .,

discrete time Wiener process

Handle in essentially the same way, but use cdfs and then pdfs

Previously found marginal fYn using transforms to be N(0, nσ2X)


To find the joint pdfs use conditional pdfs and chain rule

fY1,...,Yn(y1, . . . , yn) =n

l=1

fYl|Y1,...,Yl−1(yl|y1, . . . , yl−1).

To find conditional pdf fYn|Y1,...,Yn−1(yn|y1, . . . , yn−1), first find conditionalcdf P(Yn ≤ yn|Yn−i = yn−i; i = 1, 2, . . . , n − 1)

. Analogous to the discrete case:

P(Yn ≤ yn|Yn−i = yn−i; i = 1, 2, . . . , n − 1)

= P(Xn ≤ yn − yn−1|Yn−i = yn−i; i = 1, 2, . . . , n − 1)

= P(Xn ≤ yn − yn−1) = FX(yn − yn−1),


Differentiating the conditional cdf to obtain the conditional pdf⇒

fYn|Y1,...,Yn−1(yn|y1, . . . , yn−1) =∂

∂ynFX(yn − yn−1) = fX(yn − yn−1),

pdf chain rule⇒

fY1,...,Yn(y1, . . . , yn) =n

i=1

fX(yi − yi−1).


If fX = N(0,σ2)

fYn(yn) =exp− y2

12σ2

√2πσ2

n

i=2

exp−(yi−yi−1)2

2σ2

√2πσ2

= (2πσ2)−n/2 exp

−

12σ2(

n

i=2

(yi − yi−1)2 + y21)

.

This is a joint Gaussian pdf with mean vector 0 and covariance matrixKX(m, n) = σ2 min(m, n), m, n = 1, 2, . . .

A similar argument implies that

fYn|Yn−1(yn|yn−1) = fX(yn − yn−1)


and hence

fYn|Y1,...,Yn−1(yn|y1, . . . , yn−1) = fYn|Yn−1(yn|yn−1).

As in discrete alphabet case, a process with this property is called aMarkov process

Combine the discrete alphabet and continuous alphabet definitionsinto a common definition: a discrete time random process Yn is saidto be a Markov process if the conditional cdf’s satisfy the relation

Pr(Yn ≤ yn|Yn−i = yn−i; i = 1, 2, . . .) = Pr(Yn ≤ yn|Yn−1 = yn−1)

for all yn−1, yn−2, . . .


More specifically, such a Yn is frequently called a first-order Markovprocess because it depends on only the most recent past value. Anextended definition to nth-order Markov processes can be made inthe obvious fashion.


random variables, vectors, and processes Ω · random variables, vectors, and processes ee278:...

Documents