introduction to probability for electrical engineering · introduction to probability for...

Introduction to Probability for Electrical Engineering

Prapun SuksompongSchool of Electrical and Computer Engineering

Cornell University, Ithaca, NY [email protected]

August 24, 2009

Contents

1 Mathematical Background 41.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Enumeration / Combinatorics / Counting . . . . . . . . . . . . . . . . . . 91.3 Dirac Delta Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Classical Probability 202.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Probability Foundations 253.1 Algebra and σ-algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Kolmogorov’s Axioms for Probability . . . . . . . . . . . . . . . . . . . . . 303.3 Properties of Probability Measure . . . . . . . . . . . . . . . . . . . . . . . 313.4 Countable Ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Random Element 374.1 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3 Discrete random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.4 Continuous random variable . . . . . . . . . . . . . . . . . . . . . . . . . . 434.5 Mixed/hybrid Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.7 Misc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 PMF Examples 485.1 Random/Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 Bernoulli and Binary distributions . . . . . . . . . . . . . . . . . . . . . . . 505.3 Binomial: B(n, p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.4 Geometric: G(β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.5 Poisson Distribution: P(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.6 Compound Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.7 Hypergeometric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.8 Negative Binomial Distribution (Pascal / Polya distribution) . . . . . . . . 595.9 Beta-binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.10 Zipf or zeta random variable . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6 PDF Examples 606.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.2 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.3 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.4 Pareto: Par(α)–heavy-tailed model/density . . . . . . . . . . . . . . . . . . 676.5 Laplacian: L(α) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.6 Rayleigh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.7 Cauchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.8 More PDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7 Expectation 71

8 Inequalities 78

9 Random Vectors 829.1 Random Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

10 Transform Methods 8810.1 Probability Generating Function . . . . . . . . . . . . . . . . . . . . . . . . 8810.2 Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . . . . 8910.3 One-Sided Laplace Transform . . . . . . . . . . . . . . . . . . . . . . . . . 9010.4 Characteristic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

11 Functions of random variables 9311.1 SISO case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9311.2 MISO case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9811.3 MIMO case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10011.4 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

12 Convergences 11012.1 Summation of random variables . . . . . . . . . . . . . . . . . . . . . . . . 11612.2 Summation of independent random variables . . . . . . . . . . . . . . . . . 11612.3 Summation of i.i.d. random variable . . . . . . . . . . . . . . . . . . . . . 11712.4 Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . 118

13 Conditional Probability and Expectation 11913.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11913.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12013.3 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

2

14 Real-valued Jointly Gaussian 122

15 Bayesian Detection and Estimation 125

A Math Review 128A.1 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128A.2 Summations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131A.3 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

A.3.1 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133A.3.2 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

A.4 Gamma and Beta functions . . . . . . . . . . . . . . . . . . . . . . . . . . 140

3

1 Mathematical Background

1.1 Set Theory

1.1. Basic Set Identities:

• Idempotence: (Ac)c = A

• Commutativity (symmetry):

A ∪B = B ∪ A , A ∩B = B ∩ A

• Associativity:

A ∩ (B ∩ C) = (A ∩B) ∩ C A ∪ (B ∪ C) = (A ∪B) ∪ C

• Distributivity

A ∪ (B ∩ C) = (A ∪B) ∩ (A ∪ C)

A ∩ (B ∪ C) = (A ∩B) ∪ (A ∩ C)

• de Morgan laws

(A ∪B)c = Ac ∩Bc

(A ∩B)c = Ac ∪Bc

1.2. Basic Terminology:

• A ∩B is sometimes written simply as AB.

• Sets A and B are said to be disjoint (A ⊥ B) if and only if A ∩B = ∅

• A collection of sets (Ai : i ∈ I) is said to be pair-wise disjoint or mutually exclusive[9, p. 9] if and only if Ai∩Aj = ∅ when i 6= j.

• A collection Π = (Aα : α ∈ I) of subsets of Ω (in this case, indexed or labeled by αtaking values in an index or label set I) is said to be a partition of Ω if

(a) Ω =⋃Aα∈I and

(b) For all i 6= j, Ai ⊥ Aj (pairwise disjoint).

In which case, the collection (B ∩ Aα : α ∈ I) is a partition of B. In other words,any set B can be expressed as B =

⋃α

(B ∩ Ai) where the union is a disjoint union.

• The cardinality (or size) of a collection pr set A, denoted |A|, is the number ofelements of the collection. This number may be finite or infinite.

4

4. Intuition regarding set identities can be gleaned from Venn diagrams, but it can bemisleading to use Venn diagrams when proving theorems unless great care is taken tomake sure that the diagrams are sufficiently general to illustrate all possible cases.

5. Venn diagrams are often used as an aid to inclusion/exclusion counting. (See §2.4.)

6. Venn gave examples of Venn diagrams with four ellipses and asserted that no Venndiagram could be constructed with five ellipses.

7. Peter Hamburger and Raymond Pippert (1996) constructed a simple, reducible Venndiagram with five congruent ellipses. (Two ellipses are congruent if they are the exactsame size and shape, and differ only by their placement in the plane.)

8. Many of the logical identities given in §1.1.2 correspond to set identities, given inthe following table.

name rule

Commutative laws A ∩B = B ∩A A ∪B = B ∪A

Associative laws A ∩ (B ∩ C) = (A ∩B) ∩ C

A ∪ (B ∪ C) = (A ∪B) ∪ C

Distributive laws A ∩ (B ∪ C) = (A ∩B) ∪ (A ∩ C)A ∪ (B ∩ C) = (A ∪B) ∩ (A ∪ C)

DeMorgan’s laws A ∩B = A ∪B A ∪B = A ∩B

Complement laws A ∩A = ∅ A ∪A = U

Double complement law A = A

Idempotent laws A ∩A = A A ∪A = A

Absorption laws A ∩ (A ∪B) = A A ∪ (A ∩B) = A

Dominance laws A ∩ ∅ = ∅ A ∪ U = U

Identity laws A ∪ ∅ = A A ∩ U = A

9. In a computer, a subset of a relatively small universal domain can be represented bya bit string. Each bit location corresponds to a specific object of the universal domain,and the bit value indicates the presence (1) or absence (0) of that object in the subset.

10. In a computer, a subset of a relatively large ordered datatype or universal domaincan be represented by a binary search tree.

11. For any two finite sets A and B, |A∪B| = |A|+ |B| − |A∩B| (inclusion/exclusionprinciple). (See §2.3.)

12. Set identities can be proved by any of the following:• a containment proof: show that the left side is a subset of the right side and the

right side is a subset of the left side;• a membership table: construct the analogue of the truth table for each side of

the equation;• using other set identities.

13. For all sets A, |A| < |P(A)|.

c© 2000 by CRC Press LLC

Figure 1: Set Identities [18]

Inclusion-Exclusion Principle :∣∣∣∣∣n⋃

i=1

Ai

∣∣∣∣∣ =∑

φ 6=I⊂1,...,n

(−1)|I|+1

∣∣∣∣∣⋂

i∈I

Ai

∣∣∣∣∣.

∣∣∣∣n⋂i=1

Aci

∣∣∣∣ = |Ω|+ ∑φ 6=I⊂1,...,n

(−1)|I|∣∣∣∣⋂i∈IAi

∣∣∣∣.

If ∀i, Ai ⊂ B ( or equivalently,⋃ni=1Ai ⊂ B), then

∣∣∣∣∣n⋂

i=1

(B \ Ai)∣∣∣∣∣ = |B|+

∑

φ 6=I⊂1,...,n

(−1)|I|

∣∣∣∣∣⋂

i∈I

Ai

∣∣∣∣∣.

• An infinite set A is said to be countable if the elements of A can be enumeratedor listed in a sequence: a1, a2, . . . . Empty set and finite sets are also said to becountable. By a countably infinite set, we mean a countable set that is not finite.

• A singleton is a set with exactly one element.

• N = 1, 2, , 3, . . . R = (−∞,∞).

• For a set of sets, to avoid the repeated use of the word “set”, we will call it acollection/class/family of sets.

Definition 1.3. Monotone sequence of sets

5

• The sequence of events (A1, A2, A3, . . . ) is monotone-increasing sequence of events ifand only if

A1 ⊂ A2 ⊂ A3 ⊂ . . ..

In which case,

n⋃i=1

Ai = An

limn→∞

An =∞⋃i=1

Ai.

Put A = limn→∞

An. We then write An A; that is An A if and only if ∀n

An ⊂ An+1 and∞⋃n=1

An = A.

• The sequence of events (B1, B2, B3, . . . ) is monotone-decreasing sequence of events ifand only if

B1 ⊃ B2 ⊃ B3 ⊃ . . ..

In which case,

n⋂i=1

Bi = Bn

limn→∞

Bn =∞⋂i=1

Bi

Put B = limn→∞

Bn. We then write Bn B; that is Bn B if and only if ∀n

Bn+1 ⊂ Bn and∞⋂n=1

Bn = B.

Note that An A⇔ Acn Ac.

1.4. An (Event-)indicator function IA : Ω→ 0, 1 is defined by

IA (ω) =

1, if ω ∈ A0, otherwise.

• Alternative notation: 1A.

• A = ω : IA (ω) = 1

• A = B if and only if IA = IB

• IAc (ω) = 1− IA (ω)

• A ⊂ B ⇔ ∀ω, IA (ω) ≤ IB (ω) ⇔ ∀ω, IA (ω) = 1⇒ IB (ω) = 1

IA∩B (ω) = min (IA (ω) , IB (ω)) = IA (ω) · IB (ω)

IA∪B (ω) = max (IA (ω) , IB (ω)) = IA (ω) + IB (ω)− IA (ω) · IB (ω)

6

1.5. Suppose⋃Ni=1Ai =

⋃Ni=1Bi for all finite N ∈ N. Then,

⋃∞i=1Ai =

⋃∞i=1Bi.

Proof. To show “⊂”, suppose x ∈ ⋃∞i=1Ai. Then, ∃N0 such that x ∈ AN0 . Now, AN0 ⊂⋃N0

i=1Ai =⋃N0

i=1Bi ⊂⋃∞i=1Bi. Therefore, x ∈ ⋃∞i=1Bi. To show “⊃”, use symmetry.

1.6. Let A1, A2, . . . be a sequence of disjoint sets. Define Bn = ∪i>nAi = ∪i≥n+1Ai. Then,(1) Bn+1 ⊂ Bn and (2) ∩∞n=1Bn = ∅.

Proof. Bn = ∪i≥n+1Ai = (∪i≥nAi) ∪ An+1 = Bn+1 ∪ An+1 So, (1) is true. For (2), considertwo cases. (2.1) For element x /∈ ∪iAi, we know that x /∈ Bn and hence x /∈ ∩∞n=1Bn. (2.2)For x ∈ ∪iAi, we know that ∃i0 such that x ∈ Ai0 . Note that x can’t be in other Ai becausethe Ai’s are disjoint. So, x /∈ Bi+0 and therefore x /∈ ∩∞n=1Bn.

1.7. Any countable union can be written as a union of pairwise disjoint sets: Given anysequence of sets Fn, define a new sequence by A1 = F1, and

An = Fn ∩ F cn−1 ∩ · · · ∩ F c

1 = Fn ∩⋂

i∈[n−1]

F ci = Fn\

⋃

i∈[n−1]

Fi

.

for n ≥ 2. Then,⋃∞i=1 Fi =

⋃∞i=1Ai where the union on the RHS is a disjoint union.

Proof. Note that the An are pairwise disjoint. To see this, consider An1 , An2 where n1 6= n2.WLOG, assume n1 < n2. This implies that F c

n1get intersected in the definition of An2 . So,

An1 ∩ An2 =(Fn1 ∩ F c

n1

)︸︷︷︸

∅

∩

n1−1⋂

i=1

F ci ∩ Fn2 ∩

n2−1⋂

i=1i 6=n1

F ci

= ∅

Also, for finite N ≥ 1, we have⋃n∈[N ] Fn =

⋃n∈[N ] An (*). To see this, note that (*) is

true for N = 1. Suppose (*) is true for N = m and let B =⋃n∈[m] An =

⋃n∈[m] Fn. Now,

for N = m+ 1, by definition, we have

⋃

n∈[m+1]

An =⋃

n∈[m]

An ∪ Am+1 = (B ∪ Fm+1) ∩ Ω

= B ∪ Fm+1 =⋃

i∈[m+1]

Fi.

So, (*) is true for N = m + 1. By induction, (*) is true for all finite N . The extension to∞ is done via (1.5).

For finite union, we can modify the above statement by setting Fn = ∅ for n ≥ N .Then, An = ∅ for n ≥ N .

By construction, An : n ∈ N ⊂ σ (Fn : n ∈ N). However, in general, it is not truethat Fn : n ∈ N ⊂ σ (An : n ∈ N). For example, for finite union with N = 2, we can’tget F2 back from set operations on A1, A2 because we lost information about F1 ∩ F2. Tocreate a disjoint union which preserve the information about the overlapping parts of the

7

Fn’s, we can define the A’s by ∩n∈NBn where Bn is Fn or F cn. This is done in (1.8). However,

this leads to uncountably many Aα, which is why we used the index α above instead of n.The uncountability problem does not occur if we start with a finite union. This is shownin the next result.

1.8. Decomposition:

• Fix sets A1, A2, . . . , An, not necessarily disjoint. Let Π be a collection of all sets ofthe form

B = B1 ∩B2 ∩ · · · ∩Bn

where each Bi is either Aj or its complement. There are 2n of these, say

B(1), B(2), . . . , B(2n).

Then,

(a) Π is a partition of Ω and

(b) Π \∩j∈[n]A

cj

is a partition of ∪j∈[n]Aj.

Moreover, any Aj can be expressed as Aj =⋃i∈Sj

B(i) for some Si ⊂ [2n]. More

specifically, Aj is the union of all B(i) which is constructed by Bj = Aj

• Fix sets A1, A2, . . ., not necessarily disjoint. Let Π be a collection of all sets of the formB =

⋂n∈NBn where each Bn is either An or its complement. There are uncountably

of these hence we will index the B’s by α; that is we write B(α). Let I be the set ofall α after we eliminate all repeated B(α); note that I can still be uncountable. Then,

(a) Π =B(α) : α ∈ I

is a partition of Ω and

(b) Π \ ∩n∈NAcn is a partition of ∪n∈NAn.

Moreover, any Aj can be expressed as Aj =⋃α∈Sj

B(α) for some Sj ⊂ I. Because I is

uncountable, in general, Sj can be uncountable. More specifically, Aj is the (possiblyuncountable) union of all B(α) which is constructed by Bj = Aj. The uncountabilityof Sj can be problematic because it implies that we need uncountable union to getAj back.

1.9. Let Aα : α ∈ I be a collection of disjoint sets where I is a nonempty index set. Forany set S ⊂ I, define a mapping g on 2I by g(S) = ∪α∈SAα. Then, g is a 1:1 function ifand only if none of the Aα’s is empty.

1.10. LetA =

(x, y) ∈ R2 : (x+ a1, x+ b1) ∩ (y + a2, y + b2) 6= ∅

,

where ai < bi. Then,

A =

(x, y) ∈ R2 : x+ (a1 − b2) < y < x+ (b1 − a2)

=

(x, y) ∈ R2 : a1 − b2 < y − x < b1 − a2

.

8

x

𝑦

𝑦 = 𝑥

𝑐

𝑐 𝑐 + 𝑎1 − 𝑏2

𝑐 + 𝑏1 − 𝑎2 A

Figure 2: The region A =

(x, y) ∈ R2 : (x+ a1, x+ b1) ∩ (y + a2, y + b2) 6= ∅

.

1.2 Enumeration / Combinatorics / Counting

1.11. The four kinds of counting problems are:

(a) ordered sampling of r out of n items with replacement: nr;

(b) ordered sampling of r ≤ n out of n items without replacement: (n)r;

(c) unordered sampling of r ≤ n out of n items without replacement:(nr

);

(d) unordered sampling of r out of n items with replacement:(n+r−1

r

).

1.12. Given a set of n distinct items, select a distinct ordered sequence (word) of lengthr drawn from this set.

• Sampling with replacement: µn,r = nr

Ordered sampling of r out of n items with replacement.

µn,1 = n

µ1,r = 1

µn,r = µn,r−1 for r > 1

Examples:

∗ Suppose A is a finite set, then the cardinality of its power set is∣∣2A∣∣ = 2|A|.

∗ There are 2r binary strings/sequences of length r.

9

• Sampling without replacement:

(n)r =r−1∏

i=0

(n− i) =n!

(n− r)!= n · (n− 1) · · · (n− (r − 1))︸︷︷︸

r terms

; r ≤ n

Ordered sampling of r ≤ n out of n items without replacement.

For integers r, n such that r > n, we have (n)r = 0.

The definition in product form

(n)r =r−1∏

i=0

(n− i) = n · (n− 1) · · · (n− (r − 1))︸︷︷︸r terms

can be extended to any real number n and a non-negative integer r. We define(n)0 = 1. (This makes sense because we usually take the empty product to be1.)

(n)1 = n

(n)r = (n− (r − 1))(n)r−1. For example, (7)5 = (7− 4)(7)4.

(1)r =

1, if r = 10, if r > 1

• Ratio:

(n)rnr

=

r−1∏i=0

(n− i)r−1∏i=0

(n)

=r−1∏

i=0

(1− i

n

)

≈r−1∏

i=0

(e−

in

)= e

− 1n

r−1∑i=0

i= e−

r(r−1)2n

≈ e−r2

2n

1.13. Factorial and Permutation : The number of arrangements (permutations) of n≥ 0 distinct items is (n)n = n!.

• 0! = 1! = 1

• n! = n(n− 1)!

• n! =∞∫0

e−ttndt

• Stirling’s Formula:

n! ≈√

2πnnne−n =(√

2πe)e(n+ 1

2) ln(ne ).

10

1.14. Binomial coefficient :

(n

r

)=

(n)rr!

=n!

(n− r)!r!

This gives the number of unordered sets of size r drawn from an alphabet of size n withoutreplacement; this is unordered sampling of r ≤ n out of n items without replacement. Itis also the number of subsets of size r that can be formed from a set of n elements. Someproperties are listed below:

(a) Use nchoosek(n,r) in MATLAB.

(b) Use combin(n,r) in Mathcad. However, to do symbolic manipulation, use the facto-rial definition directly.

(c) Reflection property:(nr

)=(nn−r

).

(d)(nn

)=(n0

)= 1.

(e)(n1

)=(nn−1

)= n.

(f)(nr

)= 0 if n < r or r is a negative integer.

(g) maxr

(nr

)=(

n

bn+12 c).

(h) Pascal’s “triangle” rule:(nk

)=(n−1k

)+(n−1k−1

). This property divides the process of

choosing k items into two steps where the first step is to decide whether to choosethe first item or not.

Figure 3: Pascal Triangle.

(i)∑

0≤k≤nk even

(nk

)=∑

0≤k≤nk odd

(nk

)= 2n−1

There are many ways to show this identity.

11

(i) Consider the number of subsets of S = a1, a2, . . . , an of n distinct elements.First choose subsets A of the first n− 1 elements of S. There are 2n−1 distinctS. Then, for each A, to get set with even number of element, add element an toA if and only if |A| is odd.

(ii) Look at binomial expansion of (x+ y)n with x = 1 and y = −1.

(iii) For odd n, use the fact that(nr

)=(nn−r

).

(j)∑min(n1,n2)

k=0

(n1

k

)(n2

k

)=(n1+n2

n1

)=(n1+n2

n2

).

• This property divides the process of choosing k items from n1 +n2 into two stepswhere the first step is to choose from the first n1 items.

• Can replace the min(n1, n2) in the first sum by n1 or n2 if we define(nk

)= 0 for

k > n.

• ∑nr=1

(nr

)2=(

2nn

).

(k) Parallel summation:

n∑

m=k

(m

k

)=

(k

k

)+

(k + 1

k

)+ · · ·+

(n

k

)=

(n+ 1

k + 1

).

To see this, suppose we try to choose k + 1 items from n + 1 items a1, a2, . . . , an+1.First, we choose whether to choose a1. If so, then we need to choose the rest k itemsfrom the n items a2, . . . , an+1. Hence, we have the

(nk

)term. Now, suppose we didn’t

choose a1. Then, we still need to choose k + 1 items from n items a2, . . . , an+1. Wethen repeat the same argument on a2 in stead of a1.Equivalently,

r∑

k=0

(n+ k

k

)=

r∑

k=0

(n+ k

n

)=(n+ r + 1n+ 1

)+(n+ r + 1

r

). (1)

To prove the middle equality in (1), use induction on r.

1.15. Binomial theorem :

(x+ y)n =n∑

r=0

(n

r

)xryn−r

(a) Let x = y = 1, thenn∑r=0

(nr

)= 2n.

(b) Sum involving only the even terms (or only the odd terms):

n∑

r=0r even

(n

r

)xryn−r =

1

2((x+ y)n + (y − x)n) , and

n∑

r=0r odd

(n

r

)xryn−r =

1

2((x+ y)n − (y − x)n) .

12

In particular, if x+ y = 1, then

n∑

r=0r even

(n

r

)xryn−r =

1

2(1 + (1− 2x)n) , and (2a)

n∑

r=0r odd

(n

r

)xryn−r =

1

2(1− (1− 2x)n) . (2b)

(c) Approximation by the entropy function:

H(p) = −p logb (p)− (1− p) logb (1− p)

• Binary: b = 2⇒

H2(p) = −p log2 (p)− (1− p) log2 (1− p) .

In which case,1

n+ 12nH( rn) ≤

(n

r

)≤ 2nH( rn).

Hence,(nr

)≈ 2nH2( rn).

(d) By repeated differentiating with respect to x followed by multiplication by x, we have

• ∑nr=0 r

(nr

)xryn−r = nx(x+ y)n−1 and

• ∑nr=0 r

2(nr

)xryn−r = nx (x(n− 1)(x+ y)n−2 + (x+ y)n−1).

For x+ y = 1, we have

• ∑nr=0 r

(nr

)xr(1− x)n−r = nx and

• ∑nr=0 r

2(nr

)xr(1− x)n−r = nx (nx+ 1− x).

All identities above can be verified easily via Mathcad.

1.16. Multinomial Counting : The multinomial coefficient(

nn1 n2 ··· nr

)is defined as

r∏

i=1

(n−

i−1∑k=0

nk

ni

)

=

(n

n1

)·(n− n1

n2

)·(n− n1 − n2

n3

)· · ·(nrnr

)

=n!r∏i=1

n!

It is the number of ways that we can arrange n =r∑i=1

ni tokens when having r types of

symbols and ni indistinguishable copies/tokens of a type i symbol.

13

Table 5 Binomial coefficient identities.

Factorial expansion(nk

)= n!

k!(n−k)! , k = 0, 1, 2, . . . , n

Symmetry(nk

)=

(n

n−k

), k = 0, 1, 2, . . . , n

Monotonicity(n0

)<

(n1

)< · · · <

(n

n/2), n ≥ 0

Pascal’s identity(nk

)=

(n−1k−1

)+

(n−1

k

), k = 0, 1, 2, . . . , n

Binomial theorem (x+ y)n =∑n

k=0

(nk

)xkyn−k, n ≥ 0

Counting all subsets∑n

k=0

(nk

)= 2n, n ≥ 0

Even and odd subsets∑n

k=0(−1)k(nk

)= 0, n ≥ 0

Sum of squares∑n

k=0

(nk

)2 =(2nn

), n ≥ 0

Square of row sums[∑n

k=0

(nk

)]2 =∑2n

k=0

(2nk

), n ≥ 0

Absorption/extraction(nk

)= n

k

(n−1k−1

), k = 0

Trinomial revision(

nm

)(mk

)=

(nk

)(n−km−k

), 0 ≤ k ≤ m ≤ n

Parallel summation∑m

k=0

(n+k

k

)=

(n+m+1

m

), m,n ≥ 0

Diagonal summation∑n−m

k=0

(m+k

m

)=

(n+1m+1

), n ≥ m ≥ 0

Vandermonde convolution∑r

k=0

(mk

)(n

r−k

)=

(m+n

r

), m,n, r ≥ 0

Diagonal sums in Pascal’s∑n/2

k=0

(n−k

k

)= Fn+1 (Fibonacci numbers), n ≥ 0

triangle (§2.3.2)

Other Common Identities∑n

k=0 k(nk

)= n2n−1, n ≥ 0

∑nk=0 k

2(nk

)= n(n+ 1)2n−2, n ≥ 0

∑nk=0(−1)kk

(nk

)= 0, n ≥ 0

∑nk=0

(nk)

k+1 = 2n+1−1n+1 , n ≥ 0

∑nk=0(−1)k (n

k)k+1 = 1

n+1 , n ≥ 0∑n

k=1(−1)k−1 (nk)k = 1 + 1

2 + 13 + · · · + 1

n , n > 0∑n−1

k=0

(nk

)(n

k+1

)=

(2n

n−1

), n > 0

∑mk=0

(mk

)(n

p+k

)=

(m+nm+p

), m,n, p ≥ 0, n ≥ p+m

• Sum of squares: Choose a committee of size n from a group of n men and nwomen. The left side, rewritten as

(nk

)(n

n−k

), describes the process of selecting

committees according to the number of men, k, and the number of women,n− k, on the committee. The right side gives the total number of committeespossible.

• Absorption/extraction: From a group of n people, choose a committee of size kand a person on the committee to be its chairperson. Equivalently, first select achairperson from the entire group, and then select the remaining k−1 committeemembers from the remaining n− 1 people.


Figure 4: Binomial coefficient identities [18]

14

1.17. Multinomial Theorem

(x1 + . . .+ xr)n =

n∑

i1=0

n−i1∑

i2=0

· · ·

n−∑

j<r−1ij

∑

ir−1=0

n!(n− ∑

k<n

ik

)!∏k<n

ik!

xn−

∑j<r

ij

r

r−1∏

k=1

xikk

• r-ary entropy function: Consider any vector p = (p1, p2, . . . , pr) such that pi ≥ 0 andr∑i=1

pi = 1. We define

H (p) = −r∑

i=1

pi logb pi.

As a special case, let pi = nin

, then

(n

n1 n2 · · · nr

)=

n!r∏i=1

ni!≈ 2nH2(p)

1.18. The number of solutions to x1 + x2 + · · · + xn = k for the xi’s are nonnegativeintegers is

(k+n−1

k

)=(k+n−1n−1

).

(a) Suppose we further require that the xi are strictly positive (xi ≥ 1), then there are(k−1n−1

)solutions.

(b) Extra Lower-bound Requirement : Suppose we further require that xi ≥ aiwhere the ai are some given nonnegative integers, then the number of solution is(k−(a1+a2+···+an)+n−1

n−1

). Note that here we work with equivalent problem: y1 + y2 +

· · ·+ yn = k −∑ni=1 ai where yi ≥ 0.

(c) Extra Upper-bound Requirement : Suppose we further require that 0 ≤ xi <bi. Let Ai be the set of solutions such that xi ≥ bi and xj ≥ 0 for j 6= i. Thenumber of solutions is

(k+n−1n−1

)−|∪ni=1Ai| where the second term can be found via the

inclusion/exclusion principle

∣∣∣∣∣∣⋃

i∈[n]

Ai

∣∣∣∣∣∣=∑

I⊂[n]I 6=∅

(−1)|I|+1

∣∣∣∣∣⋂

i∈I

Ai

∣∣∣∣∣

and the fact that for any index set I ⊂ [n], we have |∩i∈IAi| =(k−(

∑i∈I bi)+n−1n−1

).

(d) Extra Range Requirement : Suppose we further require that ai ≤ xi < bi where0 ≤ ai < bi, then we work instead with yi = xi − ai. The number of solutions is(k −

n∑i=1

ai

)+ n− 1

n− 1

+

∑

I⊂[n]I 6=∅

(−1)|I|

(k −

n∑i=1

ai

)−(∑i∈I

(bi − ai))

+ n− 1

n− 1

.

15

1.19. The bars and stars argument:

• Consider the distribution of r = 10 indistinguishable balls into n = n distinguishablecells. Then, we only concern with the number of balls in each cell. Using n− 1 = 4bars, we can divide r = 10 stars into n = 5 groups. For example, ****|***||**|*would mean (4,3,0,2,1). In general, there are

(n+r−1

r

)ways of arranging the bars and

stars.

• There are(n+r−1

r

)distinct vector x = xn1 of nonnegative integers such that x1 + x2 +

· · ·+ xn = r. We use n− 1 bars to separate r 1’s.

• Suppose r letters are drawn with replacement from a set a1, a2, . . . , an. Given adrawn sequence, let xi be the number of ai in the drawn sequence. Then, there are(n+r−1

r

)possible x = xn1 .

tree diagram: a tree that displays the different alternatives in some counting process.

unordered selection (of k items from a set S): a subset of k items from S.

unordered selection (of k items from a set S with replacement): a selection of kobjects in which each object in the selection set S can be chosen arbitrarily oftenand such that the order in which the objects are selected does not matter.

Young tableau: an array obtained by replacing each cell of a Ferrers diagram by apositive integer.

2.1 SUMMARY OF COUNTING PROBLEMS

Table 1 lists many important counting problems, gives the number of objects beingcounted, together with a reference to the section of this Handbook where details can befound. Table 2 lists several important counting rules and methods, and gives the typesof counting problems that can be solved using these rules and methods.

Table 1 Counting problems.

The notation used in this table is given at the end of the table.

objects number of objects reference

Arranging objects in a row:

n distinct objects n! = P (n, n) = n(n− 1) . . . 2 · 1 §2.3.1

k out of n distinct objects nk =P (n, k) =n(n−1) . . . (n−k+1) §2.3.1

some of the n objects are identical:k1 of a first kind, k2 of a secondkind, . . . , kj of a jth kind, andwhere k1 + k2 + · · · + kj = n

(n

k1 k2 ... kj

)= n!

k1! k2!...kj !§2.3.2

none of the n objects remains in itsoriginal place (derangements)

Dn = n!(1− 1

1!+ · · ·+(−1)n 1n!

)§2.4.2

Arranging objects in a circle (where rotations, but not reflections, are equivalent):

n distinct objects (n− 1)! §2.2.1

k out of n distinct objects P (n,k)k §2.2.1

Choosing k objects from n distinct objects:

order matters, no repetitions P (n, k) = n!(n−k)! = nk §2.3.1

order matters, repetitions allowed PR(n, k) = nk §2.3.3

order does not matter, no repeti-tions

C(n, k) =(nk

)= n!

k!(n−k)! §2.3.2

order does not matter, repetitionsallowed

CR(n, k) =(

k+n−1k

)§2.3.3

c© 2000 by CRC Press LLCFigure 5: Counting problems and corresponding sections in [18].

16


Subsets:

of size k from a set of size n(nk

)§2.3.2

of all sizes from a set of size n 2n §2.3.4

of 1, . . . , n, without consecutiveelements

Fn+2 §3.1.2

Placing n objects into k cells:

distinct objects into distinct cells kn §2.2.1

distinct objects into distinct cells,no cell empty

nk

k! §2.5.2

distinct objects into identical cells

n1

+

n2

+ · · ·+

nk

= Bn §2.5.2

distinct objects into identical cells,no cell empty

nk

§2.5.2

distinct objects into distinct cells,with ki in cell i (i = 1, . . . , n),and where k1 + k2 + · · ·+ kj = n

(n

k1 k2 ... kj

)§2.3.2

identical objects into distinct cells(n+k−1

n

)§2.3.3

identical objects into distinct cells,no cell empty

(n−1k−1

)§2.3.3

identical objects into identical pk(n) §2.5.1cells

identical objects into identical pk(n) − pk−1(n) §2.5.1cells, no cell empty

Placing n distinct objects into k[

nk

]§2.5.2

nonempty cycles

Solutions to x1 + · · · + xn = k:

nonnegative integers(k+n−1

k

)=

(k+n−1

n−1

)§2.3.3

positive integers(

k−1n−1

)§2.3.3

integers where 0 ≤ ai ≤ xi for all i(

k−(a1+···+an)+n−1n−1

)§2.3.3

integers where 0 ≤ xi ≤ ai for oneor more i

inclusion/exclusion principle §2.4.2

integers where x1 ≥ · · · ≥ xn ≥ 1 pn(k) − pn−1(k) §2.5.1

integers where x1 ≥ · · · ≥ xn ≥ 0 pn(k) §2.5.1

Solutions to x1 + x2 + · · · + xn =n in nonnegative integers wherex1 ≥ x2 ≥ · · · ≥ xn ≥ 0

p(n) §2.5.1

Solutions to x1 + 2x2 + 3x3 + · · ·+nxn = n in nonnegative integers

p(n) §2.5.1


Figure 6: Counting problems (con’t) and corresponding sections in [18].

17


Functions from a k-element set to an n-element set:

all functions nk §2.2.1

one-to-one functions (n ≥ k) nk = n!(n−k)! = P (n, k) §2.2.1

onto functions (n ≤ k) inclusion/exclusion §2.4.2

partial functions(k0

)+

(k1

)n+

(k2

)n2+ · · ·+

(kk

)nk §2.3.2

= (n+ 1)k

Bit strings of length n:

all strings 2n §2.2.1

with given entries in k positions 2n−k §2.2.1

with exactly k 0s(nk

)§2.3.2

with at least k 0s(nk

)+

(n

k+1

)+ · · ·+

(nn

)§2.3.2

with equal numbers of 0s and 1s(

nn/2

)§2.3.2

palindromes 2n/2 §2.2.1

with an even number of 0s 2n−1 §2.3.4

without consecutive 0s Fn+2 §3.1.2

Partitions of a positive integer n into positive summands: §2.5.1

total number p(n)

into at most k parts pk(n)

into exactly k parts pk(n) − pk−1(n)

into parts each of size ≤ k pk(n)

Partitions of a set of size n:

all partitions B(n) §2.5.2

into k parts

nk

§2.5.2

into k parts, each part having atleast 2 elements

b(n, k) §3.1.8

Paths:from (0, 0) to (2n, 0) made up of

line segments from (i, yi) to (i+1, yi+1), where integer yi ≥ 0,yi+1 = yi ± 1

Cn §3.1.3

from (0, 0) to (2n, 0) made up ofline segments from (i, yi) to (i+1, yi+1), where integer yi > 0

Cn−1 §3.1.3

(for 0 < i < 2n), yi+1 = yi ± 1

from (0, 0) to (m,n) that move 1unit up or right at each step

(m+n

n

)§2.3.2


Figure 7: Counting problems (con’t) and corresponding sections in [18].


corners of a hexagon, allowing ro-tations and reflections

112 [k6 + 3k4 + 4k3 + 2k2 + 2k]

corners of a hexagon, allowingonly rotations

16 [k6 + k3 + 2k2 + 2k]

corners of a tetrahedron 112 [k4 + 11k2]

edges of a tetrahedron 112 [k6 + 3k4 + 8k2]

faces of a tetrahedron 112 [k4 + 11k2]

corners of a cube 124 [k8 + 17k4 + 6k2]

edges of a cube 124 [k12 + 6k7 + 3k6 + 8k4 + 6k3]

faces of a cube 124 [k6 + 3k4 + 12k3 + 8k2]

Number of sequences of wins/lossesin a n+1

2 -out-of-n playoff series(n odd)

2C(n, n+12 ) §2.3.2

Sequences a1, . . . , a2n with n 1s andn −1s, and each partial sum a1+· · · + ak ≥ 0

Cn §3.1.3

Well-formed sequences of parenthe-ses of length 2n

Cn §3.1.3

Well-parenthesized products of n+1 variables

Cn §3.1.3

Triangulations of a convex (n+ 2)-gon

Cn §3.1.3

Notation:

B(n) or Bn: Bell number nk = n(n− 1) . . . (n− k + 1) = P (n, k):falling power

b(n, k): associated Stirling number of the P (n, k) = n!(n−k)! : k-permutation

second kind

Cn = 1n+1

(2nn

): Catalan number p(n): number of partitions of n

C(n, k) =(nk

)= n!

k!(n−k)! : binomial coefficient pk(n): number of partitions of n intoat most k summands

d(n, k): associated Stirling number of p∗k(n): number of partitions of n intothe first kind exactly k summands

En: Euler number[

nk

]: Stirling cycle number

ϕ: Euler phi-function

nk

: Stirling subset number

E(n, k): Eulerian number Tn: tangent number

Fn: Fibonacci number

c© 2000 by CRC Press LLC Figure 8: Notation from [18]

18

1.3 Dirac Delta Function

The (Dirac) delta function or (unit) impulse function is denoted by δ(t). It is usuallydepicted as a vertical arrow at the origin. Note that δ(t) is not a true function; it isundefined at t = 0. We define δ(t) as a generalized function which satisfies the samplingproperty (or sifting property) ∫

φ(t)δ(t)dt = φ(0)

for any function φ(t) which is continuous at t = 0. From this definition, It follows that

(δ ∗ φ)(t) = (φ ∗ δ)(t) =

∫φ(τ)δ(t− τ)dτ = φ(t)

where we assume that φ is continuous at t. Intuitively we may visualize δ(t) as a infinitelytall, infinitely narrow rectangular pulse of unit area: lim

ε→0

1ε1[|t| ≤ ε

2

].

We list some interesting properties of δ(t) here.

• δ(t) = 0 when t 6= 0.δ(t− T ) = 0 for t 6= T .

•∫Aδ(t)dt = 1A(0).

(a)∫δ(t)dt = 1.

(b)∫0 δ(t)dt = 1.

(c)∫ x−∞ δ(t)dt = 1[0,∞)(x). Hence, we may think of δ(t) as the “derivative” of the

unit step function U(t) = 1[0,∞)(x).

•∫φ(t)δ(t)dt = φ(0) for φ continuous at 0.

•∫φ(t)δ(t− T )dt = φ(T ) for φ continuous at T . In fact, for any ε > 0,

∫ T+ε

T−εφ(t)δ(t− T )dt = φ(T ).

• δ(at) = 1|a|δ(t)

• δ(t− t1) ∗ δ(t− t2) = δ (t− (t1 + t2)).

• g(t) ∗ δ(t− t0) = g(t− t0).

• Fourier properties:

Fourier series: δ(x− a) = 12π

+ 1π

∞∑k=1

cos(n(x− a)) on [−π, π].

Fourier transform: δ(t) =∫

1ej2πftdf

19

• For a function g whose real-values roots are ti,

δ (g (t)) =n∑

k=1

δ (t− ti)|g′ (ti)|

(3)

[3, p 387]. Hence, ∫f(t)δ(g(t))dt =

∑

x:g(x)=0

f(x)

|g′(x)| . (4)

Note that the (Dirac) delta function is to be distinguished from the discrete time Kro-necker delta function.

As a finite measure, δ is a unit mass at 0; that is for any set A, we have δ(A) = 1[0 ∈ A].In which case, we have again

∫gdδ =

∫f(x)δ(dx) = g(0) for any measurable g.

For a function g : D → Rn where D ⊂ Rn,

δ(g(x)) =∑

z:g(z)=0

δ(x− z)

|det dg(z)| (5)

[3, p 387].

2 Classical Probability

Classical probability, which is based upon the ratio of the number of outcomes favorable tothe occurrence of the event of interest to the total number of possible outcomes, providedmost of the probability models used prior to the 20th century. Classical probability remainsof importance today and provides the most accessible introduction to the more generaltheory of probability.

Given a finite sample space Ω, the classical probability of an event A is

P (A) =‖A‖‖Ω‖ =

the number of cases favorable to the outcome of the event

the total number of possible cases.

• In this section, we are more apt to refer to equipossible cases as ones selected atrandom. Probabilities can be evaluated for events whose elements are chosen atrandom by enumerating the number of elements in the event.

• The bases for identifying equipossibility were often

physical symmetry (e.g. a well-balanced die, made of homogeneous material ina cubical shape) or

a balance of information or knowledge concerning the various possible outcomes.

• Equipossibility is meaningful only for finite sample space, and, in this case, the eval-uation of probability is accomplished through the definition of classical probability.

2.1. Basic properties of classical probability:

20

• P (A) ≥ 0

• P (Ω) = 1

• P (∅) = 0

• P (Ac) = 1− P (A)

• P (A ∪B) = P (A) + P (B)− P (A ∩B) which comes directly from

|A ∪B| = |A|+ |B| − |A ∩B|.

• A ⊥ B is equivalent to P (A ∩B) = 0.

• A ⊥ B ⇒ P (A ∪B) = P (A) + P (B)

• Suppose Ω = ω1, . . . , ωn and P (ωi) = 1n. Then P (A) =

∑ω∈A

p (ω).

The probability of an event is equal to the sum of the probabilities of its com-ponent outcomes because outcomes are mutually exclusive

2.2. Classical Conditional Probability : The conditional classical probability P (A|B)of event A, given that event B 6= ∅ occurred, is given by

P (A|B) =|A ∩B||B| =

P (A ∩B)

P (B). (6)

• It is the updated probability of the event A given that we now know that B occurred.

• Read “conditional probability of A given B”.

• P (A|B) = P (A ∩B|B) ≥ 0

• For any A such that B ⊂ A, we have P (A|B) = 1. This implies

P (Ω|B) = P (B|B) = 1.

• If A ⊥ C, P (A ∪ C |B ) = P (A |B ) + P (C |B )

• P (A ∩B) = P (B)P (A|B)

• P (A ∩B) ≤ P (A|B)

• P (A ∩B ∩ C) = P (A)× P (B|A)× P (C|A ∩B)

• P (A ∩B) = P (A)× P (B|A)

• P (A ∩B ∩ C) = P (A ∩B)× P (C|A ∩B)

• P (A,B |C ) = P (A |C )P (B |A,C ) = P (B |C )P (A |B,C )

21

2.3. Total Probability and Bayes Theorem If Bi, . . . , Bn is a partition of Ω, thenfor any set A,

• Total Probability Theorem: P (A) =∑n

i=1 P (A|Bi)P (Bi).

• Bayes Theorem: Suppose P (A) > 0, we have P (Bk|A) = P (A|Bk)P (Bk)∑ni=1 P (A|Bi)P (Bi)

.

2.4. Independence Events: A and B are independent (A |= B) if and only if

P (A ∩B) = P (A)P (B) (7)

In classical probability, this is equivalent to

|A ∩B||Ω| = |A||B|.

• Sometimes the definition for independence above does not agree with the everyday-language use of the word “independence”. Hence, many authors use the term “statis-tically independence” for the definition above to distinguish it from other definitions.

2.5. Having three pairwise independent events does not imply that the three events arejointly independent. In other words,

A |= B,B |= C,A |= C ; A |= B |= C.

Example: Experiment of flipping a fair coin twice. Ω = HH,HT, TH, TT. Define eventA to be the event that the first flip gives a H; that is A = HH,HT. Event B is theevent that the second flip gives a H; that is B = HH,TH. C = HH,TT. Note alsothat even though the events A and B are not disjoint, they are independent.

2.6. Consider Ω of size 2n. We are given a set A ⊂ Ω of size n. Then, P (A) = 12. We want

to find all sets B ⊂ Ω such that A |= B. (Note that without the required independence,there are 22n possible B.) For independence, we need

P (A ∩B) = P (A)P (B).

Let r = |A ∩B|. Then, r can be any integer from 0 to n. Also, let k = |B \A|. Then, thecondition for independence becomes

r

n=

1

2

r + k

n

which is equivalent to r = k. So, the construction of the set B is given by choosing relements from set A, then choose r = k elements from set Ω \ A. There are

(nr

)choices

for the first part and(nk

)=(nr

)choice for the second part. Therefore, the total number of

possible B such that A |= B isn∑

r=1

(n

r

)2

=

(2n

n

).

22

2.1 Examples

2.7. Background

(a) Historically, dice is the plural of die , but in modern standard English dice is usedas both the singular and the plural. [Excerpted from Compact Oxford English Dic-tionary.]

2.8. Chevalier de Mere’s Scandal of Arithmetic:

Which is more likely, obtaining at least one six in 4 tosses of a fair die (eventA), or obtaining at least one double six in 24 tosses of a pair of dice (event B).

We have

P (A) = 1−(

5

6

)4

= .518

and

P (B) = 1−(

35

36

)24

= .491.

Therefore, the first case is more probable.Remark: Probability theory was originally inspired by gambling problems. In 1654,

Chevalier de Mere invented a gambling system which bet even money on the second caseabove. However, when the he began losing money, he asked his mathematician friendBlaise Pascal to analyze his gambling system. Pascal discovered that the Chevalier’s systemwould lose about 51 percent of the time. Pascal became so interested in probability andtogether with another famous mathematician, Pierre de Fermat, they laid the foundationof probability theory.

2.9. A random sample of size r with replacement is taken from a population of n elements.The probability of the event that in the sample no element appears twice (that is, norepetition in our sample) is

(n)rnr

.

The probability that at least one element appears twice is

pu (n, r) = 1−r−1∏

i=1

(1− i

n

)≈ 1− e− r(r−1)

2n .

In fact, when r − 1 < n2, (A.3) gives

e12r(r−1)n

3n+2r−13n ≤

r−1∏

i=1

(1− i

n

)≤ e

12r(r−1)n .

• From the approximation, to have pu (n, r) = p, we need

r ≈ 1

2+

1

2

√1− 8n ln (1− p).

23

• Probability of coincidence birthday : Probability that there is at least two people whohave the same birthday in your class of n students

=

1, if r ≥ 365,

1−

365

365· 364

365· · · · · 365− (r − 1)

365︸︷︷︸r terms

, if 0 ≤ r ≤ 365

Birthday Paradox : In a group of 23 randomly selected people, the probabilitythat at least two will share a birthday (assuming birthdays are equally likely tooccur on any given day of the year) is about 0.5. See also (3).

Classical Probability 1) Birthday Paradox: In a group of 23 randomly selected people, the probability that

at least two will share a birthday (assuming birthdays are equally likely to occur on any given day of the year) is about 0.5. See also (3).

Events

2) ( ) ( ) 1

111

1 0.37

11 1n nn n

ni i

ii

e

P A P A n nn

−−

↓==↓

− ≈−

⎛ ⎞⎛ ⎞− − ≤ − ≤ −⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠

∏∩ [Szekely’86, p 14].

2 4 6 8 10

0

0.5

11

1−

e

11

x−⎛⎜

⎝⎞⎟⎠

x−

x 1−( ) x

x−x 1−⋅

101 x

a) ( ) ( ) ( )1 2 1 214

P A A P A P A∩ − ≤

Random Variable

3) Let i.i.d. 1, , rX X… be uniformly distributed on a finite set 1, , na a… . Then, the

probability that 1, , rX X… are all distinct is ( )( )112

1

, 1 1 1r rr

nu

i

ip n r en

−− −

=

⎛ ⎞= − − ≈ −⎜ ⎟⎝ ⎠

∏ .

a) Special case: the birthday paradox in (1).

0 5 10 15 20 25 30 35 40 45 50 550

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

r

pu(n,r) for n = 365

23365

rn==

( ),up n r

( )121

r rne−

−−

0 50 100 150 200 250 300 3500

0.1

0.2

0.3

0.4

0.5

0.6

n

0.90.70.50.30.1

ppppp

=====

23365

rn==

rn

b) The approximation comes from 1 xx e+ ≈ . c) From the approximation, to have ( ),up n r p= , we need Figure 9: pu(n, r)

2.10. Monte Hall’s Game: Started with showing a contestant 3 closed doors behind ofwhich was a prize. The contestant selected a door but before the door was opened, MonteHall, who knew which door hid the prize, opened a remaining door. The contestant wasthen allowed to either stay with his original guess or change to the other closed door.Question: better to stay or to switch? Answer: Switch. Because after given that thecontestant switched, then the probability that he won the prize is 2

3.

2.11. False Positives on Diagnostic Tests : Let D be the event that the testee has thedisease. Let + be the event that the test returns a positive result. Denote the probabilityof having the disease by pD.

Now, assume that the test always returns positive result. This is equivalent to P (+|D) =1 and P (+c|D) = 0. Also, suppose that even when the testee does not have a disease, thetest will still return a positive result with probability p+; that is P (+|Dc).

If the test returns positive result, then the probability that the testee has the disease is

P (D |+) =P (D ∩+)

P (+)=

pDpD + p+ (1− pD)

=1

1 + p+pD

(1− pD)

≈ PDP+

; for rare disease (pD 1)

24

D Dc

+ P (+ ∩D)= P (+|D)P (D)= 1(pD) = pD

P (+ ∩Dc)= P (+|Dc)P (Dc)= p+(1− pD)

P (+)= P (+∩D) +P (+∩Dc)= pD + p+(1− pD)

+c P (+c ∩D)= P (+c|D)P (D)= 0(pD) = 0

P (+c ∩Dc)= P (+c|Dc)P (Dc)= (1− p+)(1− pD)

P (+c)= P (+c∩D)+P (+c∩Dc)= (1− p+)(1− pD)

P (D)= P (+∩D) +P (+c ∩D)= pD

P (Dc)= P (+∩Dc)+P (+c∩Dc)= 1− pD

P (Ω) = 1

3 Probability Foundations

To study formal definition of probability, we start with the probability space (Ω,A, P ). LetΩ be an arbitrary space or set of points ω. Viewed probabilistically, a subset of Ω is anevent and an element ω of Ω is a sample point . Each event is a collection of outcomeswhich are elements of the sample space Ω.

The theory of probability focuses on collections of events, called event algebras andtypically denoted A (or F) that contain all the events of interest (regarding the randomexperiment E) to us, and are such that we have knowledge of their likelihood of occurrence.The probability P itself is defined as a number in the range [0, 1] associated with each eventin A.

3.1 Algebra and σ-algebra

The class 2Ω of all subsets can be too large1 for us to define probability measures withconsistency, across all member of the class. In this section, we present smaller classeswhich have several “nice” properties.

Definition 3.1. [7, Def 1.6.1 p38] An event algebra A is a collection of subsets of thesample space Ω such that it is

(1) nonempty (this is equivalent to Ω ∈ A);

(2) closed under complementation (if A ∈ A then Ac ∈ A);

(3) and closed under finite unions (if A,B ∈ A then A ∪B ∈ sA).

In other words, “A class is called an algebra on Ω if it contains Ω itself and is closed underthe formation of complements and finite unions.”

3.2. Examples of algebras

• Ω = any fixed interval of R. A = finite unions of intervals contained in Ω

• Ω = (0, 1]. B0 = the collection of all finite unions of intervals of the form (a, b] ⊂ (0, 1]

1There is no problem when Ω is countable.

25

Not a σ-field. Consider the set∞⋃i=1

(1

2i+1, 1

2i

]

3.3. Properties of an algebra A:

(a) Nonempty: ∅ ∈ A, X ∈ A

(b) A ⊂ 2Ω

(c) An algebra is closed under finite set-theoretic operations.

• A ∈ A ⇒ Ac ∈ A• A,B ∈ A ⇒ A ∪B ∈ A, A ∩B ∈ A, A\B ∈ A, A∆B = (A\B ∪B\A) ∈ A

• A1, A2, . . . , An ∈ F ⇒n⋃i=1

Ai ∈ F andn⋂i=1

Ai ∈ F

(d) The collection of algebras in Ω is closed under arbitrary intersection. In particular,let A1,A2 be algebras of Ω and let A = A1 ∩A2 be the collection of sets common toboth algebras. Then A is an algebra.

(e) The smallest A is ∅,Ω.

(f) The largest A is the set of all subsets of Ω known as the power set and denoted by2Ω.

(g) Cardinality of Algebras: An algebra of subsets of a finite set of n elements will alwayshave a cardinality of the form 2k, k ≤ n. It is the intersection of all algebras whichcontain C.

3.4. There is a smallest (in the sense of inclusion) algebra containing any given family Cof subsets of Ω. Let C ⊂ 2X , the algebra generated by C is

⋂

G is an algebraC⊂G

G,

i.e., the intersection of all algebra containing C. It is the smallest algebra containing C.

Definition 3.5. A σ-algebra A is an event algebra that is also closed under countableunions,

(∀i ∈ N)Ai ∈ A =⇒ ∪i∈N ∈ A.Remarks:

• A σ-algebra is also an algebra.

• A finite algebra is also a σ-algebra.

3.6. Because every σ-algebra is also an algebra, it has all the properties listed in (3.3).Extra properties of σ-algebra A are as followed:

26

(a) A1, A2, . . . ∈ A ⇒∞⋃j=1

Aj ∈ A and∞⋂j=1

Aj ∈ A

(b) A σ-field is closed under countable set-theoretic operations.

(c) The collection of σ-algebra in Ω is closed under arbitrary intersection , i.e., letAα be σ-algebra ∀α ∈ A where A is some index set, potentially uncountable. Then,⋂α∈AAα is still a σ-algebra.

(d) An infinite σ-algebra F on X is uncountable i.e. σ-algebra is either finite or uncount-able.

(e) If A and B are σ-algebra in X, then, it is not necessary that A ∪ B is also a σ-algebra. For example, let E1, E2 ⊂ X distinct, and not complement of one another.Let Ai = ∅, Ei, Ec

i , X. Then, Ei ∈ A1 ∪ A2 but E1 ∪ E2 /∈ A1 ∪ A2.

Definition 3.7. Generation of σ-algebra Let C ⊂ 2Ω, the σ-algebra generated byC, σ (C) is ⋂

G is a σ−fieldC⊂G

G,

i.e., the intersection of all σ-field containing C. It is the smallest σ-field containing C

• If the set Ω is not implicit, we will explicitly write σX (C)

• We will say that a set A can be generated by elements of C if A ∈ σ (C)

3.8. Properties of σ (C):

(a) σ (C) is a σ-field

(b) σ (C) is the smallest σ-field containing C in the sense that if H is a σ-field andC ⊂ H, then σ (C) ⊂ H

(c) C ⊂ σ (C)

(d) σ (σ (C)) = σ (C)

(e) If H is a σ-field and C ⊂ H, then σ (C) ⊂ H

(f) σ (∅) = σ (X) = ∅, X

(g) σ (A) = ∅, A,Ac, X for A ⊂ Ω.

(h) σ (A,B) has at most 16 elements. They are ∅, A,B,A∩B,A∪B,A\B,B\A,A∆B,and their corresponding complements. See also (3.11). Some of these 16 sets can bethe same and hence the use of “at-most” in the statement.

(i) σ (A) = σ (A ∪ ∅) = σ (A ∪ X) = σ (A ∪ ∅, X)

27

(j) A ⊂ B ⇒ σ (A) ⊂ σ (B)

(k) σ (A) , σ (B) ⊂ σ (A) ∪ σ (B) ⊂ σ (A ∪B)

(l) σ (A) = σ (A ∪ ∅) = σ (A ∪ Ω) = σ (A ∪ ∅,Ω)

3.9. For the decomposition described in (1.8), let the starting collection be C1 and thedecomposed collection be C2.

• If C1 is finite, then σ (C2) = σ (C1).

• If C1 is countable, then σ (C2) ⊂ σ (C2).

3.10. Construction of σ-algebra from countable partition : An intermediate-sizedσ-algebra (a σ-algebra smaller than 2Ω) can be constructed by first partitioning Ω intosubsets and then forming the power set of these subsets, with an individual subset nowplaying the role of an individual outcome ω.

Given a countable2 partition Π = Ai, i ∈ I of Ω, we can form a σ-algebra A byincluding all unions of subcollections:

A = ∪α∈SAα : S ⊂ I (8)

[7, Ex 1.5 p. 39] where we define⋃i∈∅Ai = ∅. It turns out that A = σ(Π). Of course, a

σ-algebra is also an algebra. Hence, (8) is also a way to construct an algebra. Note that,from (1.9), the necessary and sufficient condition for distinct S to produce distinct elementin (8) is that none of the Aα’s are empty.

In particular, for countable Ω = xi : i ∈ N, where xi’s are distinct. If we want aσ-algebra which contains xi ∀i, then, the smallest one is 2Ω which happens to be thebiggest one. So, 2Ω is the only σ-algebra which is “reasonable” to use.

3.11. Generation of σ-algebra from finite partition : If a finite collection Π =Ai, i ∈ I of non-empty sets forms a partition of Ω, then the algebra generated by Π is thesame as the σ-algebra generated by Π and it is given by (8). Moreover, |σ(Π)| is 2|I| = 2|Π|;that is distinct sets S in (8) produce distinct member of the (σ-)algebra.

Therefore, given a finite collection of sets C = C1, C2, . . . , Cn. To find an algebra or aσ-algebra generated by C, the first step is to use the Ci’s to create a partition of Ω. Using(1.8), the partition is given by

Π = ∩ni=1Bi : Bi = Ci or Cci . (9)

By (3.9), we know that σ(π) = σ(C). Note that there are seemingly 2n sets in Π however,some of them can be ∅. We can eliminate the empty set(s) from Π and it is still a partition.So the cardinality of Π in (9) (after empty-set elimination) is k where k is at most 2n. Thepartition Π is then a collection of k sets which can be renamed as A1, . . . , Ak, all of whichare non-empty. Applying the construction in (8) (with I = [n]), we then have σ(C) whosecardinality is 2k which is ≤ 22n . See also properties (3.8.7) and (3.8.8).

2In this case, Π is countable if and only if I is countable.

28

Definition 3.12. In general, the Borel σ-algebra or Borel algebra B is the σ-algebragenerated by the open subsets of Ω

• Call BΩ the σ-algebra of Borel subsets of Ω or σ-algebra of Borel sets on Ω

• Call set B ∈ BΩ Borel set of Ω

(a) On Ω = R, the σ-algebra generated by any of the followings are Borel σ-algebra :

(i) Open sets

(ii) Closed sets

(iii) Intervals

(iv) Open intervals

(v) Closed intervals

(vi) Intervals of the form (−∞, a], where a ∈ Ri. Can replace R by Qii. Can replace (−∞, a] by (−∞, a), [a,+∞), or (a,+∞)

iii. Can replace (−∞, a] by combination of those in 1(f)ii.

(b) For Ω ⊂ R, BΩ = A ∈ BR : A ⊂ Ω = BR ∩ Ω where BR ∩ Ω = A ∩ Ω : A ∈ BR

(c) Borel σ-algebra on the extended real line is the extended σ-algebra

BR = A ∪B : A ∈ BR, B ∈ ∅, −∞ , ∞ , −∞,∞

It is generated by, for example

(i) A ∪ −∞ , ∞ where σ (A) = BR

(ii)[a, b)

,[a, b]

,(a, b]

(iii) [−∞, b] , [−∞, b) , (a,+∞] , [a,+∞](iv)

[−∞, b

],[−∞, b

), (a,+∞] , [a,+∞]

Here, a, b ∈ R and a, b ∈ R ∪ ±∞

(d) Borel σ-algebra in Rk is generated by

(i) the class of open sets in Rk

(ii) the class of closed sets in Rk

(iii) the class of bounded semi-open rectangles (cells) I of the form

I =x ∈ Rk : ai < xi ≤ bi, i = 1, . . . , k

.

Note that I =k⊗i=1

(ai, bi] where ⊗ denotes the Cartesian product ×

29

(iv) the class of “southwest regions” Sx of points “southwest” to x ∈ Rk, i.e. Sx =y ∈ Rk : yi ≤ xi, i = 1, . . . , k

3.13. The Borel σ-algebra B of subsets of the reals is the usual algebra when we dealwith real- or vector-valued quantities.

Our needs will not require us to delve into these issues beyond being assured thatevents we discuss are constructed out of intervals and repeated set operations on theseintervals and these constructions will not lead us out of B. In particular, countable unionsof intervals, their complements, and much, much more are in B.

3.2 Kolmogorov’s Axioms for Probability

Definition 3.14. Kolmogorov’s Axioms for Probability [13]: A set function satisfyingK0–K4 is called a probability measure.

K0 Setup: The random experiment E is described by a probability space (Ω,A, P ) con-sisting of an event σ-algebra A and a real-valued function P : A → R.

K1 Nonnegativity: ∀A ∈ A, P (A) ≥ 0.

K2 Unit normalization: P (Ω) = 1.

K3 Finite additivity: If A,B are disjoint, then P (A ∪B) = P (A) + P (B).

K4 Monotone continuity: If (∀i > 1) Ai+1 ⊂ Ai and ∩i∈NAi = ∅ (a nested series of setsshrinking to the empty set), then

limi→∞

P (Ai) = 0.

K4′ Countable or σ-additivity: If (Ai) is a countable collection of pairwise disjoint (nooverlap) events, then

P

(∞⋃

i=1

Ai

)=∞∑

i=1

P (Ai).

• Note that there is never a problem with the convergence of the infinite sum; all partialsums of these non-negative summands are bounded above by 1.

• K4 is not a property of limits of sequences of relative frequencies nor meaningful inthe finite sample space context of classical probability, is offered by Kolmogorov toensure a degree of mathematical closure under limiting operations [7, p. 111].

• K4 is an idealization that is rejected by some accounts (usually subjectivist) of prob-ability.

• If P satisfies K0–K3, then it satisfies K4 if and only if it satisfies K4′ [7, Theorem3.5.1 p. 111].

30

Proof. To show “⇒”, consider disjoint A1, A2, . . .. Define Bn = ∪i>nAi. Then, by(1.6), Bn+1 ⊂ Bn and ∩∞n=1Bn = ∅. So, by K4, lim

n→∞P (Bn) = 0. Furthermore,

∞⋃

i=1

Ai = Bn ∪(

n⋃

i=1

Ai

)

where all the sets on the RHS are disjoint. Hence, by finite additivity,

P

(∞⋃

i=1

Ai

)= P (Bn) +

n∑

i=1

P (Ai).

Taking limiting as n→∞ gives us K4′.

To show “⇐”, see (3.17).

Equivalently, in stead of K0-K4, we can define probability measure using P0-P2 below.

Definition 3.15. A probability measure defined on a σ-algebraA of Ω is a (set) function

(P0) P : A → [0, 1]

that satisfies:

(P1,K2) P (Ω) = 1

(P2,K4′) Countable additivity : For every countable sequence (An)∞n=1 of disjoint ele-

ments of A, one has P

(∞⋃n=1

An

)=∞∑n=1

P (An)

• P (∅) = 0

• The number P (A) is called the probability of the event A

• The triple (Ω,A, P ) is called a probability measure space , or simply a probabilityspace

• A support of P is any A-set A for which P (A) = 1

3.3 Properties of Probability Measure

3.16. Properties of probability measures:

(a) P (∅) = 0

(b) 0 ≤ P ≤ 1: For any A ∈ A, 0 ≤ P (A) ≤ 1

(c) If P (A) = 1, A is not necessary Ω.

(d) Additivity: A,B ∈ A, A ∩B = ∅ ⇒ P (A ∪B) = P (A) + P (B)

31

(e) Monotonicity: A,B ∈ A, A ⊂ B ⇒ P (A) ≤ P (B) and P (B − A) = P (B)− P (A)

(f) P (Ac) = 1− P (A)

(g) P (A) + P (B) = P (A ∪B) + P (A ∩B) = P (A−B) + 2P (A ∩B) + P (B − A).P (A ∪B) = P (A) + P (B)− P (A ∩B).

(h) P (A ∪B) ≥ max(P (A), P (B)) ≥ min(P (A), P (B)) ≥ P (A ∩B)

(i) Inclusion-exclusion principle:

P

(n⋃k=1

Ak

)=∑i

P (Ai)−∑i<j

P (Ai ∩ Aj) +∑

i<j<k

P (Ai ∩ Aj ∩ Ak)+

· · ·+ (−1)n+1 P (A1 ∩ · · · ∩ An)

In a more compact form,

P

(n⋃

i=1

Ai

)=

∑

φ 6=I⊂1,...,n

(−1)|I|+1 P

(⋂

i∈I

Ai

).

For example,

P (A1 ∪ A2 ∪ A3) = P (A1) + P (A2) + P (A3)

− P (A1 ∩ A2)− P (A1 ∩ A3)− P (A2 ∩ A3)

+ P (A1 ∩ A2 ∩ A3) .

See also (8.2). Moreover, for any event B, we have

P

(n⋂

k=1

Ack ∩B)

= P (B) +∑

∅6=I⊂[n]

(−1)|I| P

(⋂

i∈I

Ai ∩B). (10)

(j) Finite additivity: If A =n⋃j=1

Aj with Aj ∈ A disjoint, then P (A) =n∑j=1

P (Aj)

• If A and B are disjoint sets in A, then P (A ∪B) = P (A) + P (B)

(k) Subadditivity or Boole’s Inequality: If A1, . . . , An are events, not necessarily

disjoint, then P

(n⋃i=1

Ai

)≤

n∑i=1

P (Ai)

(l) σ-subadditivity: If A1, A2, . . . is a sequence of measurable sets, not necessarily

disjoint, then P

(∞⋃i=1

Ai

)≤∞∑i=1

P (Ai)

• This formula is known as the union bound in engineering.

• If A1, A2, . . . is a sequence of events, not necessarily disjoint, then ∀α ∈ (0, 1],

P

(∞⋃i=1

Ai

)≤(∞∑i=1

P (Ai)

)α

32

3.17. Conditional continuity from above. If B1 ⊃ B2 ⊃ B3 ⊃ · · · is a decreasing sequenceof measurable sets, then P (

⋂∞i=1Bi) = lim

j→∞P (Bj). In a more compact notation, if Bi B,

then P (B) = limj→∞

P (Bj).

Proof. Let B =∞⋂i=1

Bi. LetAk = Bk\Bk+1, i.e, the new part. We consider two partitions of

B1: (1) B1 = B∪∞⋃j=1

Aj and (2) B1 = Bn∪n−1⋃j=1

Aj. (1) implies P (B1)−P (B) =∞∑j=1

P (Aj).

(2) implies P (B1)− P (Bn) =n−1∑j=1

P (Aj). We then have

limn→∞

(P (B1)− P (Bn))(1)=∞∑

j=1

P (Aj)(2)= P (B1)− P (B) .

3.18. Let A be a σ-algebra. Suppose that P : A → [0, 1] satisfies (P1) and is finitelyadditive. Then, the following are equivalent:

• (P2): If An ∈ A, disjoint, then P

(∞⋃n=1

An

)=∞∑n=1

P (An)

• (K4) If An ∈ A, and An ∅, then P (An) 0

• (Continuity from above) If An ∈ A, and An A, then P (An) P (A)

• If An ∈ A, and An Ω, then P (An) P (Ω) = 1

• (Continuity from below) If An ∈ A, and An A, then P (An) P (A)

Hence, a probability measure satisfies all five properties above. Continuity from aboveand continuity from below are collectively called sequential continuity properties .

In fact, for probability measure P , let An be a sequence of events in A which convergesto A (i.e. An → A). Then A ∈ A and lim

n→∞P (An) = P (A). Of course, both An A and

An A imply An → A. Note also that

P

(∞⋃

n=1

An

)= lim

N→∞P

(N⋃

n=1

An

),

and

P

(∞⋂

n=1

An

)= lim

N→∞P

(N⋂

n=1

An

).

Alternative form of the sequential continuity properties are

P

(∞⋃

n=1

An

)= lim

N→∞P (AN) , if An ⊂ An+1

33

and

P

(∞⋂

n=1

An

)= lim

N→∞P (AN) , if An+1 ⊂ An.

3.19. Given a common event algebraA , probability measures P1, . . ., Pm, and the numbers

λ1, . . ., λm, λi ≥ 0,m∑1

λi = 1, a convex combination P =m∑1

λiPi of probability measures is

a probability measure

3.20. A can not contain an uncountable, disjoint collection of sets of positive probability.

Definition 3.21. Discrete probability measure P is a discrete probability measureif ∃ finitely or countably many points ωk and nonnegative masses mk such that ∀A ∈ AP (A) =

∑k:ωk∈A

mk =∑k

mkIA (ωk)

If there is just one of these points, say ω0, with mass m0 = 1, then P is a unit massat ω0. In this case, ∀A ∈ A, P (A) = IA (ω0).Notation: P = δω0

• Here, Ω can be uncountable.

3.4 Countable Ω

A sample space Ω is countable if it is either finite or countably infinite. It is countablyinfinite if it has as many elements as there are integers. In either case, the element of Ωcan be enumerated as, say, ω1, ω2, . . . . If the event algebra A contains each singleton setωk (from which it follows that A is the power set of Ω), then we specify probabilitiessatisfying the Kolmogorov axioms through a restriction to the set S = ωk of singletonevents.

Definition 3.22. When Ω is countable, a probability mass function (pmf) is anyfunction p : Ω→ [0, 1] such that ∑

ω∈Ω

p(ω) = 1.

When the elements of Ω are enumerated, then it is common to abbreviate p(ωi) = pi.

3.23. Every pmf p defines a probability measure P and conversely. Their relationship isgiven by

p(ω) = P (ω), (11)

P (A) =∑

ω∈A

p(ω). (12)

The convenience of a specification by pmf becomes clear when Ω is a finite set of, say, nelements. Specifying P requires specifying 2n values, one for each event in A, and doing soin a manner that is consistent with the Kolmogorov axioms. However, specifying p requiresonly providing n values, one for each element of Ω, satisfying the simple constraints ofnonnegativity and addition to 1. The probability measure P satisfying (12) automaticallysatisfies the Kolmogorov axioms.

34

3.5 Independence

Definition 3.24. Independence between events and collections of events.

(a) Two events A, B are called independent if P (A ∩B) = P (A)P (B)

(i) An event with probability 0 or 1 is independent of any event (including itself).In particular, ∅ and Ω are independent of any events.

(ii) Two events A, B with positive probabilities are independent if and only ifP (B |A) = P (B), which is equivalent to P (A |B ) = P (A)When A and/or B has zero probability, A and B are automatically independent.

(iii) An event A is independent of itself if and only if P (A) is 0 or 1.

(iv) If A an B are independent, then the two classes σ (A) = ∅, A,Ac,Ω andσ (B) = ∅, B,Bc,Ω are independent.

(v) Suppose A and B are disjoint. A and B are independent if and only if P (A) = 0or P (B) = 0.

(vi) Suppose A ⊂ B. A and B are independent if and only if P (A) = P (B)P (B)+1

.

(b) Independence for finite collection A1, . . . , An of sets:

≡ P

(⋂j∈J

Aj

)=∏j∈J

P (Aj) ∀J ⊂ [n] and |J | ≥ 2

Note that the case when j = 1 automatically holds. The case when j = 0can be regard as the ∅ event case, which is also trivially true.

There aren∑j=2

(nj

)= 2n − 1− n constraints.

Example: A1, A2, A3 are independent if and only if

P (A1 ∩ A2 ∩ A3) = P (A1)P (A2)P (A3)

P (A1 ∩ A2) = P (A1)P (A2)

P (A1 ∩ A3) = P (A1)P (A3)

P (A2 ∩ A3) = P (A2)P (A3)

Remark : The first equality alone is not enough for independence. See acounter example below. In fact, it is possible for the first equation to holdwhile the last three fail as shown in (3.26.b). It is also possible to constructevents such that the last three equations hold (pairwise independence), butthe first one does not as demonstrated in (3.26.a).

≡ P (B1 ∩B2 ∩ · · · ∩Bn) = P (B1)P (B2) · · ·P (Bn) where Bi = Ai or Bi = Ω

(c) Independence for collection Aα : α ∈ I of sets:

≡ ∀ finite J ⊂ I, P

( ⋂α∈J

Aα

)=∏α∈J

P (A)

35

≡ Each of the finite subcollection is independent.

(d) Independence for finite collection A1, . . . ,An of classes:

≡ the finite collection of sets A1, . . . , An is independent where Ai ∈ Ai .

≡ P (B1 ∩B2 ∩ · · · ∩Bn) = P (B1)P (B2) · · ·P (Bn) where Bi ∈ Ai or Bi = Ω

≡ P (B1 ∩B2 ∩ · · · ∩Bn) = P (B1)P (B2) · · ·P (Bn) where Bi ∈ Ai ∪ Ω≡ ∀i ∀Bi ⊂ Ai B1, . . . ,Bn are independent.

≡ A1 ∪ Ω , . . . ,An ∪ Ω are independent.

≡ A1 ∪ ∅ , . . . ,An ∪ ∅ are independent.

(e) Independence for collection Aθ : θ ∈ Θ of classes:

≡ Any collection Aθ : θ ∈ Θ of sets is independent where Aθ ∈ Aθ≡ Any finite subcollection of classes is independent.

≡ ∀ finite Λ ⊂ Θ, P

( ⋂θ∈Λ

Aθ

)=∏θ∈Λ

P (Aθ)

• By definition, a subcollection of independent events is also independent.

• The class ∅,Ω is independent from any class.

Definition 3.25. A collection of events Aα is called pairwise independent if for everydistinct events Aα1 , Aα2 , we have P (Aα1 ∩ Aα2) = P (Aα1)P (Aα2)

• If a collection of events A:α : α ∈ I is independent, then it is pairwise independent.The converse is false. See (a) in example (3.26).

• For K ⊂ J , P

( ⋂α∈J

Aα

)=∏α∈J

P (A) does not imply P

( ⋂α∈K

Aα

)=∏α∈K

P (A)

Example 3.26.

(a) Let Ω = 1, 2, 3, 4, A = 2Ω, P (i) = 14, A1 = 1, 2, A2 = 1, 3, A3 = 2, 3. Then

P (Ai ∩ Aj) = P (Ai)P (Aj) for all i 6= j butP (A1 ∩ A2 ∩ A3) 6= P (A1)P (A2)P (A3)

(b) Let Ω = 1, 2, 3, 4, 5, 6, A = 2Ω, P (i) = 16, A1 = 1, 2, 3, 4, A2 = A3 = 4, 5, 6.

Then, P (A1 ∩ A2 ∩ A3) = P (A1)P (A2)P (A3) but P (Ai ∩ Aj) 6= P (Ai)P (Aj) forall i 6= j

(c) The paradox of ”almost sure” events: Consider two random events with probabilitiesof 99% and 99.99%, respectively. One could say that the two probabilities are nearlythe same, both events are almost sure to occur. Nevertheless the difference maybecome significant in certain cases. Consider, for instance, independent events whichmay occur on any day of the year with probability p = 99%; then the probabilityP that it will occur every day of the year is less than 3%, while if p = 99.99% thenP = 97%.

36

4 Random Element

4.1. A function X : Ω(Ω,A)

→ E(E,BE)

is said to be a random element of E if and only if

X is measurable which is equivalent to each of the following statements.

≡ X−1 (BE) ⊂ A

≡ σ (X) ⊂ A

≡ (reduced form) ∃C ⊂ BE such that σ (C) = BE and X−1 (C) ⊂ A

• When E ⊂ R, X is called a random variable.

• When E ⊂ Rd, then X is called a random vector.

Definition 4.2. X = Y almost surely (a.s.) if P [X = Y ] = 1.

• The a.s. equality is an equivalence relation.

4.3. Law of X or Distribution of X: PX = µX = PX−1 = L(X) : E(E,E)

→ [0, 1]

µX (A) = PX (A) = P(X−1 (A)

)= P X−1 (A)

= P (ω : X (ω) ∈ A) = P (X ∈ A)

4.4. For X ∈ Lp, limt→∞

tpP [|X| ≥ t]→ 0

4.5. A Borel set S is called a support of X if PX (Sc) = 0 (or equivalently PX (S) = 1)

4.1 Random Variable

Definition 4.6. A real-valued function X(ω) defined for points ω in a sample space Ω iscalled a random variable.

• Random variables are important because they provide a compact way of referring toevents via their numerical attributes.

• The abbreviation r.v. will be used for “real-valued random variables” [11, p. 1].

• Technically, a random variable must be measurable.

4.7. At a certain point in most probability courses, the sample space is rarely mentionedanymore and we work directly with random variables. The sample space often “disappears”but it is really there in the background.

4.8. For B ∈ R, we use the shorthand

• [X ∈ B] = ω ∈ Ω : X(ω) ∈ B and

• P [X ∈ B] = P ([X ∈ B]) = P (ω ∈ Ω : X(ω) ∈ B) .

37

• In particular, P [X < x] is a shorthand for P (ω ∈ Ω : X(ω) < x).

4.9. If X and Y are random variables, we use the shorthand

• [X ∈ B, Y ∈ C] = ω ∈ Ω : X(ω) ∈ B and Y (ω) ∈ C = [X ∈ B] ∩ [Y ∈ C].

• P [X ∈ B, Y ∈ C] = P ([X ∈ B] ∩ [Y ∈ C]) .

4.10. Every random variable can be written as a sum of a discrete random variable anda continuous random variable.

4.11. A random variable can have at most countably many point x such thatP [X = x] > 0.

4.12. Point masses probability measures / Direc measures, usually written εα, δα, is usedto denote point mass of size one at the point α. In this case,

• PX α = 1

• PX (αc) = 0

• FX (x) = 1[α,∞) (x)

4.13. There exists distributions that are neither discrete nor continuous.Let µ (A) = 1

2µ1 (A) + 1

2µ2 (A) for µ1 discrete and µ2 coming from a density.

4.14. When X and Y take finitely many values, say x1, . . . , xm and y1, . . . , yn, respectively,we can arrange the probabilities pX,Y (xi, yj) in the m× n matrix

pX,Y (x1, y1) pX,Y (x1, y2) . . . pX,Y (x1, yn)pX,Y (x2, y1) pX,Y (x2, y2) . . . pX,Y (x2, yn)

......

. . ....

pX,Y (xm, y1) pX,Y (xm, y2) . . . pX,Y (xm, yn)

.

• The sum of the entries in the ith row is PX(xi), and the sum of the entries in the jthcolumn is PY (yj).

• The sum of all the entries in the matrix is one.

4.2 Distribution Function

4.15. The (cumulative) distribution function (cdf ) induced by a probability P on(R=Ω,B)

is the function F (x) = P (−∞, x].

The (cumulative) distribution function (cdf ) of the random variable X is the func-tion FX (x) = PX (−∞, x] = P [X ≤ x].

• The distribution PX can be obtained from the distribution function by setting PX (−∞, x] =FX (x); that is FX uniquely determines PX

• 0 ≤ FX ≤ 1

38

C1 FX is non-decreasing

C2 FX is right continuous:

∀x FX(x+)≡ lim

y→xy>x

FX (y) ≡ limyx

FX (y) = FX (x) = P [X ≤ x]

x 0P X x

countable set C, 0XP C

XF is continuous

25) Every random variable can be written as a sum of a discrete random variable and a

continuous random variable.

26) A random variable can have at most countably many point x such that

0P X x .

27) The (cumulative) distribution function (cdf) induced by a probability P on

, is the function ,F x P x .

The (cumulative) distribution function (cdf) of the random variable X is the

function ,X

XF x P x P X x .

The distribution XP can be obtained from the distribution function by setting

,X

XP x F x ; that isXF uniquely determines

XP .

0 1XF

XF is non-decreasing

XF is right continuous:

x lim limX X X Xy x y xy x

F x F y F y F x P X x

.

lim 0Xx

F x

and lim 1Xx

F x

.

x lim lim ,X

X X Xy x y xy x

F x F y F y P x P X x

.

XP X x P x F x F x = the jump or saltus in F at x.

x y

,P x y F y F x

,P x y F y F x

Figure 10: Right-continuous function at jump point

The function g(x) = P [X < x] is left-continuous in x.

C3 limx→−∞

FX (x) = 0 and limx→∞

FX (x) = 1.

• ∀x FX (x−) ≡ limy→xy<x

FX (y) ≡ limyx

FX (y) = PX (−∞, x) = P [X < x]

• P [X = x] = PX x = F (x)− F (x−) = the jump or saltus in F at x

• ∀ x < yP ((x, y]) = F (y)− F (x)

P ([x, y]) = F (y)− F(x−)

P ([x, y)) = F(y−)− F

(x−)

P ((x, y)) = F(y−)− F (x)

P (x) = F (x)− F(x−)

• A function F is the distribution function of some probability measure on (R,B) ifand only if one has (C1), (C2), and (C3).

• If F satisfies (C1), (C2), and (C3), then ∃ a unique probability measure P on (R,B)that has P (a, b] = F (b)− F (a) ∀a, b ∈ R

• FX is continuous if and only if P [X = x] = 0

• FX is continuous if and only if PX is continuous.

• FX has at most countably many points of discontinuity.

39

Definition 4.16. It is traditional to write X ∼ F to indicate that “X has distribution F”[23, p. 25].

Definition 4.17. FX,A (x) = P ([X ≤ x] ∩ A)

4.18. Left-continuous inverse: g−1 (y) = inf x ∈ R : g (x) ≥ y, y ∈ (0, 1)

Trick: Just flip the graph along the line x = y, then make the graph left-continuous.

If g is a cdf, then only consider 0,1y .

Example:

Distribution F 1F

Exponential 1 xe 1

ln u

Extreme value 1

x a

bee

ln lna b u

Geometric 1 1i

p

ln

ln 1

u

p

Logistic 1

1

1x

be

1ln 1b

u

Pareto 1 ax 1

au

Weibull 1

bx

ae

1

ln ba u

34) If F is non-decreasing, right continuous, with lim 0x

F x

and lim 1x

F x

,

then F is the CDF of some probability measure on , .

In particular, let ~ 0,1U and 1X F U , then XF F . Here,

1F is the

left-continuous inverse of F.

To generate ~X , set 1

lnX U

.

35) Point masses probability measures / Direc measures, usually written , , is

used to denote point mass of size one at the point . In this case,

x

g x 1 inf :g y x g x y

y 2x

3x 4x

2x

3x

4x

1y 2y 3y 4y 5y

6y

6y 1y

3y 4y

Figure 11: Left-continuous inverse on (0,1)

• Trick : Just flip the graph along the line x = y, then make the graph left-continuous.

• If g is a cdf, then only consider y ∈ (0, 1). It is called the inverse CDF [7, Def 8.4.1,p. 238] or quantile function.

In [23, Def 2.16, p. 25], the inverse CDF is defined using strict inequality “>”rather than “≥”.

• See table 1 for examples.

Distribution F F−1

Exponential 1− e−λx − 1λ

ln (u)

Extreme value 1− e−ex−ab a+ b ln lnu

Geometric 1− (1− p)i⌈

lnuln(1−p)

⌉

Logistic 1− 1

1+ex−µb

µ− b ln(

1u− 1)

Pareto 1− x−a u−1a

Weibull 1− e(xa)b

a (lnu)1b

Table 1: Left-continuous inverse

40

Definition 4.19. Let X be a random variable with distribution function F . Suppose thatp ∈ (0, 1). A value of x such that F (x−) = P [X < x] ≤ p and F (x) = P [X ≤ x] ≥ p iscalled a quantile of order p for the distribution. Roughly speaking, a quantile of order pis a value where the cumulative distribution crosses p. Note that it is not unique. SupposeF (x) = p on an interval [a, b], then all x ∈ [a, b] are quantile of order p.

A quantile of order 12

is called a median of the distribution. When there is only onemedian, it is frequently used as a measure of the center of the distribution. A quantileof order 1

4is called a first quartile and the quantile of order 3

4is called a third quartile. A

median is a second quartile.Assuming uniqueness, let q1, q2, and q3 denote the first, second, and third quartiles

of X. The interquartile range is defined to be q3 − q1, and is sometimes used as a mea-sure of the spread of the distribution with respect to the median. The five parametersmaxX, q1, q2, q3,minX are often referred to as the five-number summary. Graphically,the five numbers are often displayed as a boxplot.

4.20. If F is non-decreasing, right continuous, with limx→−∞

F (x) = 0 and limx→∞

F (x) = 1,

then F is the CDF of some probability measure on (R,B).In particular, let U ∼ U (0, 1) and X = F−1 (U), then FX = F . Here, F−1 is the

left-continuous inverse of F . Note that we just explicitly define a random variable X(ω)with distribution function F on Ω = (0, 1).

• For example, to generate X ∼ E (λ), set X = − 1λ

ln (U)

4.21. Random Variable Generation

(a) Inverse-Transform Method: To generate a random variable X with CDF F , setX = F−1(U) where U is uniform on (0, 1). See also 4.20.

(b) Acceptance-Rejection Method: To generate a random variable X with pdf F ,first find an easy-to-generate pdf g and a constant C such that Cg ≥ f .

(1) Generate a random variable Z from g(z).

(2) Generate a uniform random variable U on (0, 1) independently of Z.

(3) If U ≤ f(Z)Cg(Z)

, then return X = Z (“accept”). Otherwise, go back to step (1)

(“reject”).

4.3 Discrete random variable

Definition 4.22. A random variable X is said to be a discrete random variable ifthere exists countable distinct real numbers xk such that

∑

k

P [X = xk] = 1.

≡ ∃ countable set x1, x2, . . . such that µX (x1, x2, . . .) = 1

≡ X has a countable support x1, x2, . . .

41

⇒ X is completely determined by the values µX (x1) , µX (x2) , . . .

• pi = pX (xi) = P [X = xi]

Definition 4.23. When X is a discrete random variable taking distinct values xk, wedefine its probability mass function (pmf) by

pX(xk) = P [X = xk].

• We can use stem plot to visualize pX .

• If Ω is countable, then there can be only countably many value of X(ω). So, anyrandom variable defined on countable Ω is discrete.

• Sometimes, we write p(xk) or pxk in stead of pX(xk).

• P [X ∈ B] =∑

xk∈B P [X = xk].

• FX(x) =∑

xkpX(xk)U(x− xk).

Definition 4.24 (Discrete CDF). A cdf which can be written in the form Fd(x) =∑k piU(x − xk) is called a discrete cdf [7, Def. 5.4.1, p. 163]. Here, U is the unit step

function, xk is an arbitrary countable set of real numbers, and pi is a countable set ofpositive numbers that sum to 1.

Definition 4.25. An integer-valued random variable is a discrete random variable whosedistinct values are xk = k.

For integer-valued random variables,

P [X ∈ B] =∑

k∈B

P [X = k].

4.26. Properties of pmf

• p : Ω→ [0, 1].

• 0 ≤ pX ≤ 1.

• ∑k pX(xk) = 1.

Definition 4.27. Sometimes, it is convenient to work with the “pdf” of a discrete r.v.Given that X is a discrete random variable which is defined as in definition 4.23. Then,the “pdf” of X is

fX(x) =∑

xk

pX(xk)δ(x− xk), x ∈ R. (13)

Although the delta function is not a well-defined function3, this technique does allow easymanipulation of mixed distribution. The definition of quantities involving discrete randomvariables and the corresponding properties can then be derived from the pdf and hencethere is no need to talk about pmf at all!

3Rigorously, it is a unit measure at 0.

42

4.4 Continuous random variable

Definition 4.28. A random variable X is said to be a continuous random variable ifand only if any one of the following equivalent conditions holds.

≡ ∀x, P [X = x] = 0

≡ ∀ countable set C, PX (C) = 0

≡ FX is continuous

4.29. f is (probability) density function f (with respect to Lebesgue measure) of arandom variable X (or the distribution PX)

≡ PX have density f with respect to Lebesgue measure.

≡ PX is absolutely continuous w.r.t. the Lebesgue measure (PX λ) with f =dPX

dλ, the Radon-Nikodym derivative.

≡ f is a nonnegative Borel function on R such that ∀B ∈ BR PX (B) =∫Bf(x)dx =∫

B

fdλ where λ is the Lebesgue measure. (This extends nicely to the random vector

case.)

≡ X is absolutely continuous

≡ X (or FX) comes from the density f

≡ ∀x ∈ R FX (x) =x∫−∞

f (t)dt

≡ ∀a, b FX (b)− FX (a) =b∫a

f (x)dx

4.30. If F does differentiate to f and f is continuous, it follows by the fundamentaltheorem of calculus that f is indeed a density for F . That is, if F has a continuousderivative, this derivative can serve as the density f .

4.31. Suppose a random variable X has a density f .

• F need not differentiate to f everywhere.

When X ∼ U(a, b), FX is not differentiable at a nor b.

•∫Ω

f (x) dx = 1

• f is determined only Lebesgue-a.e. That is, If g = f Lebesgue-a.e., then g can alsoserve as a density for X and PX

• f is nonnegative a.e. [9, stated on p. 138]

43

• X is a continuous random variable

• f at its continuity points must be the derivative of F

• P [X ∈ [a, b]] = P [X ∈ [a, b)] = P [X ∈ (a, b]] = P [X ∈ (a, b)] because the corre-sponding integrals over an interval are not affected by whether or not the endpointsare included or excluded. In other words, P [X = a] = P [X = b] = 0.

• P [fx (X) = 0] = 0

4.32. fX (x) = E [δ (X − x)]

Definition 4.33 (Absolutely Continuous CDF). An absolutely continuous cdf Fac can bewritten in the form

Fac(x) =

∫ x

−∞f(z)dz,

where the integrand,

f(x) =d

dxFac(x),

is defined a.e., and is a nonnegative, integrable function (possibly having discontinuities)satisfying ∫

f(x)dx = 1.

4.34. Any nonnegative function that integrates to one is a probability density function(pdf) [9, p. 139].

4.35. Remarks: Some useful intuitions

(a) Approximately, for a small ∆x, P [X ∈ [x, x+ ∆x]] =∫ x+∆x

xfX(t)dt ≈ f −X(x)∆x.

This is why we call fX the density function.

(b) In fact, fX(x) = lim∆x→0

P [x<X≤x+∆x]∆x

4.36. Let T be an absolutely continuous nonnegative random variable with cumulativedistribution function F and density f on the interval [0,∞). The following terms are oftenused when T denotes the lieftime of a device or system.

(a) Its survival-, survivor-, or reliability-function is:

R (t) = P [T > t] =

∞∫

t

f (x)dx = 1− F (t) .

• R(0) = P [T > 0] = P [T ≥ 0] = 1.

(b) The mean time of failure (MTTF) = E [T ] =∫∞

0R(t)dt.

44

(c) The (age-specific) failure rate or hazard function os a device or system with lifetimeT is

r (t) = limδ→0

P [T ≤ t+ δ|T > t]

δ= −R

′(t)

R(t)=f (t)

R (t)=

d

dtlnR(t).

(i) r (t) δ ≈ P [T ∈ (t, t+ δ] |T > t ]

(ii) R(t) = e−∫ t0 r(τ)dτ .

(iii) f(t) = r(t)e−∫ t0 r(τ)dτ

• For T ∼ E(λ), r(t) = λ.

See also [9, section 5.7].

Definition 4.37. A random variable whose cdf is continuous but whose derivative is thezero function is said to be singular.

• See Cantor-type distribution in [5, p. 35–36].

• It has no density. (Otherwise, the cdf is the zero function.) So, ∃ continuous randomvariable X with no density. Hence, ∃ random variable X with no density.

• Even when we allow the use of delta function for the density as in the case of mixedr.v., it still has no density because there is no jump in the cdf.

• There exists singular r.v. whose cdf is strictly increasing.

Definition 4.38. fX,A (x) = ddxFX,A (x). See also definition 4.17.

4.5 Mixed/hybrid Distribution

There are many occasion where we have a random variable whose distribution is a combina-tion of a normalized linear combination of discrete and absolutely continuous distributions.For convenience, we use the Dirac delta function to link the pmf to a pdf as in definition4.27. Then, we only have to concern about the pdf of a mixed distribution/r.v.

4.39. By allowing density functions to contain impulses, the cdfs of mixed random variablescan be expressed in the form F (x) =

∫(−∞,x]

f(t)dt.

4.40. Given a cdf FX of a mixed random variable X, the density fX is given by

fX(x) = fX(x) +∑

k

P [X = xk]δ(x− xk),

where

• the xi are the distinct points at which FX has jump discontinuities, and

• fX (x) =

F ′X (x) , FX is differentiable at x0, otherwise.

45

In which case,

E [g(X)] =

∫g(x)fX(x)dx+

∑

k

g(xk)P [X = xk].

Note also that P [X = xk] = F (xk)− F (x−k )

4.41. Suppose the cdf F can be expressed in the form F (x) = G(x)U(x − x0) for somefunction G. Then, the density is f(x) = G′(x)U(x − x0) + G(x0)δ(x − x0). Note thatG(x0) = F (x0) = P [X = x0] is the jump of the cdf at x0. When the random variable iscontinuous, G(x0) = 0 and thus f(x) = G′(x)U(x− x0).

4.6 Independence

Definition 4.42. A family of random variables Xi : i ∈ I is independent if ∀ finiteJ ⊂ I, the family of random variables Xi : i ∈ J is independent. In words, “an infinitecollection of random elements is by definition independent if each finite subcollection is.”Hence, we only need to know how to test independence for finite collection.

(a) (Ei, Ei)’s are not required to be the same.

(b) The collection of random variables 1Ai : i ∈ I is independent iff the collection ofevents (sets) Ai : i ∈ I is independent.

Definition 4.43. Independence among finite collection of random variables: For finite I,the following statements are equivalent

≡ (Xi)i∈I are independent (or mutually independent [2, p 182]).

≡ P [Xi ∈ Hi ∀i ∈ I] =∏i∈IP [Xi ∈ Hi]where Hi ∈ Ei

≡ P

[(Xi : i ∈ I) ∈ ×

i∈IHi

]=∏i∈IP [Xi ∈ Hi] where Hi ∈ Ei

≡ P (Xi:i∈I)(×i∈IHi

)=∏i∈IPXi (Hi) where Hi ∈ Ei

≡ P [Xi ≤ xi ∀i ∈ I] =∏i∈IP [Xi ≤ xi]

≡ [Factorization Criterion] F(Xi:i∈I) ((xi : i ∈ I)) =∏i∈IFXi (xi)

≡ Xi and X i−11 are independent ∀i ≥ 2

≡ σ (Xi) and σ(X i−1

1

)are independent ∀i ≥ 2

≡ Discrete random variables Xi’s with countable range E:

P [Xi = xi ∀i ∈ I] =∏

i∈I

P [Xi = xi] ∀xi ∈ E ∀i ∈ I

≡ Absolutely continuous Xi with density fXi : f(Xi:i∈I) ((xi : i ∈ I)) =∏i∈IfXi (xi)

46

Definition 4.44. If the Xα, α ∈ I are independent and each has the same marginaldistribution with distribution Q, we say that the Xα’s are iid (independent and identically

distributed) and we write Xαiid∼ Q

• The abbreviation can be IID [23, p 39].

Definition 4.45. A pairwise independent collection of random variables is a set ofrandom variables any two of which are independent.

(a) Any collection of (mutually) independent random variables is pairwise independent

(b) Some pairwise independent collections are not independent. See example (4.46).

Example 4.46. Let suppose X, Y , and Z have the following joint probability distribution:pX,Y,Z (x, y, z) = 1

4for (x, y, z) ∈ (0, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 0). This, for example,

can be constructed by starting with independent X and Y that are Bernoulli-12. Then set

Z = X ⊕ Y = X + Y mod 2.

(a) X, Y, Z are pairwise independent.

(b) The combination of X |= Z and Y |= Z does not imply (X, Y ) |= Z.

Definition 4.47. The convolution of probability measures µ1 and µ2 on (R,B) is themeasure µ1 ∗ µ2 defined by

(µ1 ∗ µ2) (H) =

∫µ2 (H − x)µ1 (dx)H ∈ BR

(a) µX ∗ µY = µY ∗ µX and µX ∗ (µY ∗ µZ) = (µX ∗ µY ) ∗ µZ(b) If FX and GX are distribution functions corresponding to µX and µY , the distribution

function corresponding to µX ∗ µY is

(µX ∗ µY ) (−∞, z] =

∫FY (z − x)µX (dx)

In this case, it is notationally convenient to replace µX (dx) by dFX (x) (StieltjesIntegral.) Then, (µX ∗ µY ) (−∞, z] =

∫FY (z − x)dFX (x). This is denoted by

(FX ∗ FY ) (z). That is

FZ (z) = (FX ∗ FY ) (z) =

∫FY (z − x)dFX (x)

(c) If densityfY exists, FX ∗ FY has density FX ∗ fY , where

(FX ∗ fY ) (z) =

∫fY (z − x)dFX (x)

(i) If Y (or FY ) is absolutely continuous with density fY , then for any X (orFX), X + Y (or FX ∗ FY ) is absolutely continuous with density (FX ∗ fY ) (z) =∫fY (z − x)dFX (x)

47

If, in addition, FX has density fX . Then,

∫fY (z − x)dFX (x) =

∫fY (z − x)fX (x) dx

This is denoted by fX ∗ fYIn other words, if densities fX , fY exist, then FX ∗ FY has density fX ∗ fY , where

(fX ∗ fY ) (z) =

∫fY (z − x)fX (x) dx

4.48. If random variables X and Y are independent and have distribution µX and µY ,then X + Y has distribution µX ∗ µY4.49. Expectation and independence

(a) LetX and Y be nonnegative independent random variables on (Ω,A, P ), then E [XY ] =EXEY

(b) IfX1, X2, . . . , Xn are independent and gk’s complex-valued measurable function. Then

gk (Xk)’s are independent. Moreover, if gk (Xk) is integrable, then E[

n∏k=1

gk (Xk)

]=

n∏k=1

E [gk (Xk)]

(c) If X1, . . . , Xn are independent andn∑k=1

Xk has a finite second moment, then all Xk

have finite second moments as well.

Moreover, Var

[n∑k=1

Xk

]=

n∑k=1

VarXk.

(d) If pairwise independent Xi ∈ L2, then Var

[k∑i=1

Xi

]=

k∑i=1

Var [Xi]

4.7 Misc

4.50. The mode of a discrete probability distribution is the value at which its probabilitymass function takes its maximum value. The mode of a continuous probability distributionis the value at which its probability density function attains its maximum value.

• the mode is not necessarily unique, since the probability mass function or probabilitydensity function may achieve its maximum value at several points.

5 PMF Examples

The following pmf will be defined on its support S. For Ω larger than S, we will simplyput the pmf to be 0.

48

X ∼ Support set X pX (k) ϕX(u)

Uniform Un 1, 2, . . . , n 1n

U0,1,...,n−1 0, 1, . . . , n− 1 1n

1−eiunn(1−eiu)

Bernoulli B(1, p) 0, 1

1− p, k = 0p, k = 1

Binomial B(n, p) 0, 1, . . . , n(nk

)pk(1− p)n−k (1− p+ peju)

n

Geometric G(p) N ∪ 0 (1− p)pk 1−β1−βeiu

Geometric G ′(p) N (1− p)k−1p

Poisson P(λ) N ∪ 0 e−λ λk

k!eλ(e

iu−1)

Table 2: Examples of probability mass functions. Here, p, β ∈ (0, 1). λ > 0.

5.1 Random/Uniform

5.1. Rn,UnWhen an experiment results in a finite number of “equally likely” or “totally random”

outcomes, we model it with a uniform random variable. We say that X is uniformlydistributed on [n] if

P [X = k] =1

n, k ∈ [n].

We write X ∼ Un.

• pi = 1n

for i ∈ S = 1, 2, . . . , n.

• Examples

classical game of chance / classical probability drawing at random

fair gaming devices (well-balanced coins and dice, well shuffled decks of cards)

experiment where

∗ there are only n possible outcomes and they are all equally probable

∗ there is a balance of information about outcomes

5.2. Uniform on a finite set: U(S)Suppose |S| = n, then p(x) = 1

nfor all x ∈ S.

Example 5.3. For X uniform on [-M:1:M], we have EX = 0 and VarX = M(M+1)3

.For X uniform on [N:1:M], we have EX = M−N

2and VarX = 1

12(M −N)(M −N − 2).

Example 5.4. Set S = 0, 1, 2, . . . ,M , then the sum of two independent U(S) has pmf

p(k) =(M + 1)− |k −M |

(M + 1)2

for k = 0, 1, . . . , 2M . Note its triangular shape with maximum value at p(M) = 1M+1

. Tovisualize the pmf in MATLAB, try

49

k = 0:2*M;

P = (1/((M+1)^2))*ones(1,M+1);

P = conv(P,P); stem(k,P)

5.2 Bernoulli and Binary distributions

5.5. Bernoulli : B(1, p) or Bernoulli(p)

• S = 0, 1

• p0 = q = 1− p, p1 = p

• EX = E [X2] = p.Var [X] = p− p2 = p (1− p). Note that the variance is maximized at p = 0.5.

PMF

Random/Uniform: Rn

1

ipn

for = 1,2, ,n .

Ex

classical game of chance / classical probability drawing at random

fair gaming devices (well-balanced coins and dice, well shuffled decks of cards)

high-rate coded digital data

experiment where

there are only n possible outcomes and they are all equally probable

there is a balance of information about outcomes

Bernoulli: (1,p)

= 0,1; p0 = q = 1-p, p1 = p

1 0 1EX p p p

1 0 1EX p p p

2 2 21 1 0E X p p p

22 2

Var 1X E X EX p p p p

Alternatively, 2 2

Var 1 1 0 1 1 1X p p p p p p p p p p .

Note that the variance is maximized at p = 0.5.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.25

0

p 1 p( )

10 p

Figure 12: The plot for p(1− p).

5.6. Binary : Suppose X takes only two values a and b with b > a. P [X = b] = p.

(a) X can be expressed as X = (b− a) I+a, where I is a Bernoulli random variable withP [I = 1] = p.

(b) VarX = (b− a)2 V arI = (b− a)2 p (1− p) . Note that it is still maximized at p = 1/2.

(c) Suppose a = −b. Then, X = 2I + a = 2I − b. In which case, VarX = 2b2p (1− p).

(d) Suppose Xk are independent random variables taking on two values b and a withP [Xk = p] = p = 1 − P [Xk = a] where b > a. Then, the sum S =

∑nk=1Xk is

“binomial” on k(b− a) + an : k = 0, 1, . . . , n where the probability at point k(b−a) + an is

(nk

)pk(1− p)n−k.

50

5.3 Binomial: B(n, p)

5.7. Binomial distribution with size n and parameter p. p ∈ [0, 1].

(a) pi =(ni

)pi (1− p)n−i for i ∈ S = 0, 1, 2, . . . , n

• Use binopdf(i,n,p) in MATLAB.

(b) X is the number of success in n independent Bernoulli trials and hence the sum of nindependent, identically distributed Bernoulli r.v.

(c) ϕX (u) = (1− p+ peju)n

(d) EX = np

(e) EX2 = (np)2 + np (1− p)

(f) Var [X] = np (1− p)

(g) Tail probability:n∑r=k

(nr

)pr (1− p)n−r = Ip (k, n− k + 1)

(h) Maximum probability value happens at kmax = modeX = b(n+ 1) pc ≈ np

• When (n+1)p is an integer, then the maximum is achieved at kmax and kmax−1.

(i) By (2),

P [X is even] =1

2(1 + (1− 2p)n) , and

P [X is odd] =1

2(1− (1− 2p)n) .

(j) If we have E1, . . ., En, n unlinked repetition of E and event A for E , the the distributionB(n,p) describe the probability that A occurs k times in E1, . . . ,En.

(k) Gaussian Approximation for Binomial Probabilities: When n is large, binomial distri-bution becomes difficult to compute directly because of the need to calculate factorialterms. We can use

P [X = k] ' 1√2πnp (1− p)

e−(k−np)22np(1−p) , (14)

which comes from approximating X by Gaussian Y with the same mean and varianceand the relation

P [X = k] ' P [X ≤ k]− P [X ≤ k − 1]

' P [Y ≤ k]− P [Y ≤ k − 1] ' fY (k).

See also (12.23).

(l) Approximation:(nk

)pk (1− p)n−k = (np)k

k!e−np

(1 +O

(np2, k

2

n

))

51

• Normal Approximation to Poisson Distribution with large λ:

Let ( )~X λP . X can be though of as a sum of i.i.d. ( )0~iX λP , i.e., 1

n

ii

X X=

= ∑ , where

0nλ λ= . Hence X is approximately normal ( ),λ λN for λ large.

Some says that the normal approximation is good when 5λ > .

The above figure compare 1) Poisson when x is integer, 2) Gaussian, 3) Gamma, 4) Binomial.

• If :g + →Z R is any bounded function and ( )~ λΛ P , then ( ) ( )1 0g gλ Λ + − Λ Λ =⎡ ⎤⎣ ⎦E .

Proof. ( ) ( )( ) ( ) ( )

( ) ( )

1

0 0 1

1 1

0 0

1 1! ! !

1 1! !

0

i i i

i i i

i m

i m

g i ig i e e g i ig ii i i

e g i g mi m

λ λ

λ

λ λ λλ

λ λ

+∞ ∞ ∞− −

= = =

+ +∞ ∞−

= =

⎛ ⎞+ − = + −⎜ ⎟

⎝ ⎠⎛ ⎞

= + − +⎜ ⎟⎝ ⎠

=

∑ ∑ ∑

∑ ∑

Any function :f + →Z R for which ( ) 0f Λ =⎡ ⎤⎣ ⎦E can be expressed in the form

( ) ( ) ( )1f j g j jg jλ= + − for a bounded function g.

Thus, conversely, if ( ) ( )1 0g gλ Λ + − Λ Λ =⎡ ⎤⎣ ⎦E for all bounded g, then Λ has the Poisson

distribution ( )λP .

• Poisson distribution can be obtained as a limit from negative binomial distributions. (Thus, the negative binomial distribution with parameters r and p can be approximated by the

Poisson distribution with parameter rqp

λ = (maen-matching), provided that p is

“sufficiently” close to 1 and r is “sufficiently” large. • Let X be Poisson with mean λ . Suppose, that the mean λ is chosen in accord with a

probability distribution ( )F λΛ . Hence,

p 0.05:= n 100:= λ 5:=

0 5 100

0.05

0.1

0.15e λ− λx

Γ x 1+( )⋅

1

2 π⋅ λe

1−2 λ⋅

x λ−( )2

⋅

e x− xλ 1−⋅

Γ λ( )

Γ n 1+( )

Γ n x− 1+( ) Γ x 1+( )⋅px⋅ 1 p−( ) n x−⋅

x

p 0.05:= n 800:= λ 40:=

0 20 40 600

0.02

0.04

0.06e λ− λx

Γ x 1+( )⋅

1

2 π⋅ λe

1−2 λ⋅

x λ−( )2

⋅

e x− xλ 1−⋅

Γ λ( )

Γ n 1+( )

Γ n x− 1+( ) Γ x 1+( )⋅px⋅ 1 p−( ) n x−⋅

x

Figure 13: Gaussian approximation to Binomial, Poisson distribution, and Gamma distri-bution.

5.4 Geometric: G(β)

A geometric distribution is defined by the fact that for some β ∈ [0, 1), pk+1 = βpk for allk ∈ S where S can be either N or N ∪ 0.

• When its support is N, pk = (1− β) βk−1. This is referred to as G1 (β) or geometric1 (β).In MATLAB, use geopdf(k-1,1-β).

• When its support is N ∪ 0, pk = (1− β) βk. This is referred to as G0 (β) orgeometric0 (β). In MATLAB, use geopdf(k,1-β).

5.8. Consider X ∼ G0 (β).

• pi = (1− β) βi, for S = N ∪ 0 , 0 ≤ β < 1

• β = mm+1

where m = average waiting time/ lifetime

• P [X = k] = P [k failures followed by a success] = (P [failures])k P [success]P [X ≥ k] = βk = the probability of having at least k initial failure = the probabilityof having to perform at least k+1 trials.P [X > k] = βk+1 = the probability of having at least k+1 initial failure.

• Memoryless property:

P [X ≥ k + c |X ≥ k ] = P [X ≥ c], k, c > 0.

P [X > k + c |X ≥ k ] = P [X > c], k, c > 0.

If a success has not occurred in the first k trials (already fails for k times),then the probability of having to perform at least j more trials is the same theprobability of initially having to perform at least jtrials.

Each time a failure occurs, the system “forgets” and begins anew as if it wereperforming the first trial.

52

Geometric r.v. is the only discrete r.v. that satisfies the memoryless property.

• Ex.

lifetimes of components, measured in discrete time units, when the fail catas-trophically (without degradation due to aging)

waiting times

∗ for next customer in a queue

∗ between radioactive disintegrations

∗ between photon emission

number of repeated, unlinked random experiments that must be performed priorto the first occurrence of a given event A

∗ number of coin tosses prior to the first appearance of a ‘head’number of trials required to observe the first success

• The sum of independent G0(p) and G0(q) has pmf

(1− p) (1− q) qk+1−pk+1

q−p , p 6= q

(k + 1) (1− p)2 pk, p = q

for k ∈ N ∪ 0.

5.9. Consider X ∼ G1 (β).

• P [X > k] = βk

• Suppose independent Xi ∼ G1(βi). min (X1, X2, . . . , Xn) ∼ G1 (∏n

i=1 βi).

5.5 Poisson Distribution: P(λ)

5.10. Characterized by

• pX(k) = P [X = k] = e−λ λk

k!; or equivalently,

• ϕX (u) = eλ(eiu−1),

where λ ∈ (0,∞) is called the parameter or intensity parameter of the distribution.In MATLAB, use poisspdf(k,lambda).

5.11. Denoted by P (λ).

5.12. In stead of X, Poisson random variable is usually denoted by Λ.

5.13. EX = VarX = λ.

5.14. Successive probabilities are connected via the relation kpX(k) = λpX(k − 1).

5.15. modeX = bλc.

53

Most probable value (imax) Associated max probability0 < λ < 1 0 e−λ

λ ∈ N λ− 1, λ λλ

λ!e−λ

λ ≥ 1, λ /∈ N bλc λbλc

bλc! e−λ

• Note that when λ ∈ N, there are two maximums at λ− 1 and λ.

• When λ 1, pX (bλc) ≈ 1√2πλ

via the Stirling’s formula (1.13).

5.16. P [X ≥ 2] = 1− e−λ − λe−λ = O (λ2) .The cumulative probabilities can be found by

P [X ≤ k](∗)= P

[k+1∑

i=1

Xi > 1

]=

1

Γ (k + 1)

∞∫

λ

e−ttxdt,

P [X > k] = P [X ≥ k + 1](∗)= P

[k+1∑

i=1

Xi ≤ 1

]=

1

Γ (k + 1)

λ∫

0

e−ttxdt,

where the Xi’s are i.i.d. E(λ). The equalities given by (*) are easily obtained viacounting the number of events from rate-λ Poisson process on interval [0, 1].

5.17. Fano factor (index of dispersion): VarXEX = 1

An important property of the Poisson and Compound Poisson laws is that their classesare close under convolution (independent summation). In particular, we have divisibilityproperties (5.22) and (5.32) which are straightforward to prove from their characteristicfunctions.

5.18 (Recursion equations). Suppose X ∼ P(λ). Let mk(λ) = E[Xk]

and µk(λ) =E[(X − EX)k

].

mk+1(λ) = λ (mk(λ) +m′k(λ)) (15)

µk+1(λ) = λ (kµk−1(λ) + µ′k(λ)) (16)

[15, p 112]. Starting with m1 = λ = µ2 and µ1 = 0, the above equations lead to recursivedetermination of the moments mk and µk.

5.19. E[

1X+1

]= 1

λ

(1− e−λ

). Because for d ∈ N, Y = 1

X+1

(∑dn=0 anX

n)

can be expressed

as(∑d−1

n=0 bnXn)

+ cX+1

, the value of EY is easy to find if we know EXn.

5.20. Mixed Poisson distribution: Let X be Poisson with mean λ. Suppose, that themean λ is chosen in accord with a probability distribution whose characteristic function isϕΛ. Then,

ϕX (u) = E[E[eiuX |Λ

]]= E

[eΛ(eiu−1)

]= E

[ei(−i(e

iu−1))Λ]

= ϕΛ

(−i(eiu − 1

)).

54

• EX = EΛ.

• VarX = Var Λ + EΛ.

• E [X2] = E [Λ2] + EΛ.

• Var[X|Λ] = E [X|Λ] = Λ.

• When Λ is a nonnegative integer-valued random variable, we have GX(z) = GΛ (ez−1)and P [X = 0] = GΛ

(1z

).

• E [XΛ] = E [Λ2]

• Cov [X,Λ] = Var Λ

5.21. Thinned Poisson: Suppose we have X → s → Y where X ∼ P (λ). The box sis a binomial channel with success probability s. (Each 1 in the X get through the channelwith success probability s.)

• Note that Y is in fact a random sum∑X

i=1 Ii where i.i.d. Ii has Bernoulli distributionwith parameter s.

• Y ∼ P (sλ);

• p (x |y ) = e−λ(1−s) (λ(1−s))x−y(x−y)!

; x ≥ y (shifted Poisson);

[Levy and Baxter, 2002]

5.22. Finite additivity: Suppose we have independent Λi ∼ P (λi), then∑n

i=1 Λi ∼P (∑n

i=1 λi).

5.23. Raikov’s theorem: independent random variables can have their sum Poisson-distributed only if every component of the sum is Poisson-distributed.

5.24. Countable Additivity Theorem [12, p 5]: Let (Xj : j =∈ N) be independentrandom variables, and assume that Xj has the distribution P(µj) for each j. If

∞∑

j=1

µj (17)

converges to µ, then S =∞∑j=1

Xj converges with probability 1, and S has distribution P(µ).

If on the other hand (17) diverges, then S diverges with probability 1.

5.25. Let X1, X2, . . . , Xn be independent, and let Xj have distribution P(µj) for allj. Then Sn =

∑nj=1Xj has distribution P(µ), with µ =

∑nj=1 µj; and so, whenever∑n

j=1 rj = s,

P [Xj = rj ∀j|Sn = s] =s!

r1!r2! · · · rn!

n∏

j=1

(µjµ

)rj

which follows the multinomial distribution [12, p 6–7].

55

• If X and Y are independent Poisson random variables with respective parameters λ

and µ, then (1) Z = X+Y is P(λ+µ) and (2) conditioned on Z = z, X is B(z, λ

λ+µ

).

So, E [X|Z] = λλ+µ

Z, Var[X|Z] = Z λµ(λ+µ)2

, and E [Var[X|Z]] = λµλ+µ

.

5.26. One of the reasons why Poisson distribution is important is because many naturalphenomenons can be modeled by Poisson processes. For example, if we consider the numberof occurrences Λ during a time interval of length τ in a rate-λ homogeneous Poisson process,then Λ ∼ P(λτ).

Example 5.27.

• The first use of the Poisson model is said to have been by a Prussian physician,von Bortkiewicz, who found that the annual number of late-19th-century Prussiansoldiers kicked to death by horses followed a Poisson distribution [7, p 150].

• #photons emitted by a light source of intensity λ [photons/second] in time τ

• #atoms of radioactive material undergoing decay in time τ

• #clicks in a Geiger counter in τ seconds when the average number of click in 1 secondis λ.

• #dopant atoms deposited to make a small device such as an FET

• #customers arriving in a queue or workstations requesting service from a file serverin time τ

• Counts of demands for telephone connections

• number of occurrences of rare events in time τ

• #soldiers kicked to death by horses

• Counts of defects in a semiconductor chip.

5.28. Normal Approximation to Poisson Distribution with large λ: Let X ∼ P (λ). X

can be though of as a sum of i.i.d.Xi ∼ P (λn), i.e., X =n∑i=1

Xi, where nλn = λ. Hence X

is approximately normal N (λ, λ) for λ large.Some says that the normal approximation is good when λ > 5.

5.29. Poisson distribution can be obtained as a limit from negative binomial distributions.Thus, the negative binomial distribution with parameters r and p can be approximatedby the Poisson distribution with parameter λ = rq

p(mean-matching), provided that p is

“sufficiently” close to 1 and r is “sufficiently” large.

56

5.30. Convergence of sum of bernoulli random variables to the Poisson LawSuppose that for each n ∈ N

Xn,1, Xn,2, . . . , Xn,rn

are independent; the probability space for the sequence may change with n. Such a collec-tion is called a triangular array [1] or double sequence [8] which captures the natureof the collection when it is arranged as

X1,1, X1,2, . . . , X1,r1 ,X2,1, X2,2, . . . , X2,r2 ,

...... · · · ...

Xn,1, Xn,2, . . . , Xn,rn ,...

... · · · ...

where the random variables in each row are independent. Let Sn = Xn,1 +Xn,2 + · · ·+Xn,rn

be the sum of the random variables in the nth row.Consider a triangular array of bernoulli random variables Xn,k with P [Xn,k = 1] = pn,k.

If max1≤k≤rn

pn,k → 0 andrn∑k=1

pn,k → λ as n→∞, then the sums Sn converges in distribution

to the Poisson law. In other words, Poisson distribution rare events limit of the binomial(large n, small p).

As a simple special case, consider a triangular array of bernoulli random variables Xn,k

with P [Xn,k = 1] = pn. If npn → λ as n→∞, then the sums Sn converges in distributionto the Poisson law.

To show this special case directly, we bound the first i terms of n! to get (n−i)ii!≤(ni

)≤

ni

i!. Using the upper bound,

(n

i

)pin (1− pn)n−i ≤ 1

i!(npn)i︸︷︷︸→λi

(1− pn)−i︸︷︷︸→1

(1− npnn

)n

︸︷︷︸→e−λ

.

The lower bound gives the same limit because (n− i)i =(n−in

)ini where the first term

→ 1.

5.6 Compound Poisson

Given an arbitrary probability measure µ and a positive real number λ, the compoundPoisson distribution CP (λ, µ) is the distribution of the sum

∑Λj=1 Vj where the Vj are i.i.d.

with distribution µ and Λ is a P (λ) random variable, independent of the Vj.Sometimes, it is written as POIS (λµ). The parameter λ is called the rate of CP (λ, µ)

and µ is called the base distribution.

5.31. The mean and variance of CP (λ,L (V )) are λEV and λEV 2 respectively.

5.32. If Z ∼ CP (λ, q), then ϕZ (t) = eλ(ϕq(t)−1).

57

An important property of the Poisson and Compound Poisson laws is that their classesare close under convolution (independent summation). In particular, we have divisibilityproperties (5.22) and (5.32) which are straightforward to prove from their characteristicfunctions.

5.33. Divisibility property of the compound Poisson law: Suppose we have inde-pendent Λi ∼ CP

(λi, µ

(i)), then

∑ni=1 Λi ∼ CP

(λ, 1

λ

∑ni=1 λiµ

(i))

where λ =∑n

i=1 λi.

Proof.

ϕ n∑i=1

Zi(t) =

n∏

i=1

eλi(ϕqi (t)−1) = exp

(n∑

i=1

λi (ϕqi (t)− 1)

)

= exp

(n∑

i=1

λiϕqi (t)− λ)

= exp

(λ

(1

λ

n∑

i=1

λiϕqi (t)− 1

))

We usually focus on the case when µ is a discrete probability measure on N = 1, 2, . . ..In which case, we usually refer to µ by the pmf q on N; q is called the base pmf. Equivalently,CP (λ, q) is also the distribution of the sum

∑i∈N iΛi where (Λi : i ∈ N) are independent

with Λi ∼ P (λqi). Note that∑

i∈N λqi = λ. The Poisson distribution is a special case ofthe compound Poisson distribution where we set q to be the point mass at 1.

5.34. The compound negative binomial [Bower, Gerber, Hickman, Jones, and Nesbitt,1982, Ch 11] can be approximated by the compound Poisson distribution.

5.7 Hypergeometric

An urn contains N white balls and M black balls. One draws n balls without replacement,so n ≤ N +M . One gets X white balls and n−X black balls.

P [X = x] =

(Nx)( M

n−x)(N+M

n ), 0 ≤ x ≤ N and 0 ≤ n− x ≤M

0, otherwise

5.35. The hypergeometric distributions “converge” to the binomial distribution: Assumethat n is fixed, while N and M increase to +∞ with lim

N→∞M→∞

NN+M

= p, then

p (x)→(n

x

)px (1− p)n−x (binomial).

Note that binomial is just drawing balls with replacement:

p (x) =

(nx

)NxMn−x

(N +M)n=

(n

x

)(N

N +M

)x(M

N +M

)n−x.

Intuitively, when N and M large, there is not much difference in drawing n balls with orwithout replacement.

58

5.36. Extension: If we have m colors and Ni balls of color i. The urn contains N =N1 + · · ·+Nm balls. One draws n balls without replacement. Call Xi the number of ballsof color i drawn among n balls. (Of course, X1 + · · ·+Xm = n.)

P [X1 = x1, . . . , Xm = xm] =

(N1x1

)(N2x2

)···(Nmxm)(Nn)

, x1 + · · ·+ xm = n and xi ≥ 0

0, otherwise

5.8 Negative Binomial Distribution (Pascal / Polya distribution)

5.37. The probability that the rth success occurs on the (x+ r)th trial

= p

((x+ r − 1r − 1

)pr−1 (1− p)x

)=

(x+ r − 1r − 1

)pr (1− p)x

=

(x+ r − 1

x

)pr (1− p)x

i.e. among the first (x+ r − 1) trials, there are r − 1 successes and x failures.

• Fix r.

• ϕX (u) = pr 1(1−(1−p)eiu)r

• EX = rqp

and Var [X] = rqp2

, where q = 1− p.

• Note that if we define(nx

)≡ n(n−1)·····(n−(x−1))

x(x−1)·····1 . Then,(−rx

)≡ (−1)x

r (r + 1) · · · · · (r + (x− 1))

x (x− 1) · · · · · 1 = (−1)x(r + x− 1

x

).

• If independent Xi ∼ NegBin (ri, p), then∑i

Xi ∼ NegBin

(∑i

ri, p

). This is easy to

see from the characteristic function.

• When r = 1, we have geometric distribution. Hence, when r is a positive integer,negative binomial is an independent sum of i.i.d. geometric.

• p (x) = Γ(r+x)Γ(r)x!

pr (1− p)x

5.38. A negative binomial distribution can arise as a mixture of Poisson distributions with

mean distributed as a gamma distribution Γ(q = r, λ = p

1−p

).

Let X be Poisson with mean λ. Suppose, that the mean λ is chosen in accord with aprobability distribution FΛ (λ). Then, ϕX (u) = ϕΛ (−i (eiu − 1)) [see compound Poissondistribution]. Here, Λ ∼ Γ (q, λ0); hence, ϕΛ (u) = 1(

1−i uλ0

)q .l

So, ϕX (u) =(

1− eiu−1λ0

)−q=

(λ0λ0+1

1− 1λ0+1

eiu

)q, which is negative binomial with p = λ0

λ0+1.

So, a.k.a. Poisson-gamma distribution, or simply compound Poisson distribution.

59

5.9 Beta-binomial distribution

A variable with a beta binomial distribution is distributed as a binomial distribution withparameter p, where p is distribution with a beta distribution with parameters αand β.

• P (k |p) =(nk

)pk (1− p)n−k .

• f (p) = fβq1,q2 (p) = Γ(q1+q2)Γ(q1)Γ(q2)

pq1−1 (1− p)q2−1 1(0,1) (p)

• pmf: P (k) =(nk

)Γ(q1+q2)

Γ(q1)Γ(q2)Γ(k+q1)Γ(n−k+q2)

Γ(q1+q2+n)=(nk

)β(k+q1,n−k+q2)

β(q1,q2)

• EX = nq1q1+q2

• VarX = nq1q2(n+q1+q2)

(q1+q2)2(1+q1+q2)

5.10 Zipf or zeta random variable

5.39. P [X = k] = 1ξ(p)

1kp

where k ∈ N, p > 1, and ξ is the zeta function defined in (A.13).

5.40. E [Xn] = ξ(p−n)ξ(p)

is finite for n < p− 1, and E [Xn] =∞ for n ≥ p− 1.

6 PDF Examples

6.1 Uniform Distribution

6.1. Characterization for uniform[a, b]:

(a) f (x) = 1b−aU (x− a)U (b− x) =

0 x < a, x > b1b−a a ≤ x ≤ b

(b) F (x) =

0 x < a, x > bx−ab−a a ≤ x ≤ b

(c) ϕX (u) = eiub+a2

sin(u b−a2 )u b−a

2

(d) MX(s) = esb−esas(b−a)

.

6.2. For most purpose, it does not matter whether the value of the density f at theendpoints are 0 or 1

b−a .

Example 6.3.

• Phase of oscillators ⇒ [-π, π] or [0,2π]

• Phase of received signals in incoherent communications → usual broadcast carrierphase φ ∼ U(-ππ)

• Mobile cellular communication: multipath → path phases φc ∼ U(−π, π)

60

X ∼ fX (x) ϕX(u)

Uniform U(a, b) 1b−a1[a,b] (x) eiu

b+a2

sin(u b−a2 )u b−a

2

Exponential E(λ) λe−λx1[0,∞] (x) λλ−iu

Shifted Exponential (µ, s0) 1µ−s0 e

− x−s0µ−s0 1[s0,∞)(x)

Truncated Exp. αe−αa−e−αb e

−αx1[a,b] (x)

Laplacian L(α) α2e−α|x| α2

α2+u2

Normal N (m,σ2) 1σ√

2πe−

12(x−mσ )

2

eium−12σ2u2

Normal N (m,Λ) 1

(2π)n2√

det(Λ)e−

12

(x−m)TΛ−1(x−m) ejuTm− 1

2uTΛu

Gamma Γ (q, λ) λqxq−1e−λx

Γ(q)1(0,∞)(x) 1

(1−iuλ)

q

Pareto Par(α) αx−(α+1)1[1,∞] (x)

Par(α, c) = cPar(α) αc

(cx

)α+11(c,∞)(x)

Beta β (q1, q2) Γ(q1+q2)Γ(q1)Γ(q2)

xq1−1 (1− x)q2−1 1(0,1) (x)

Beta prime Γ(q1+q2)Γ(q1)Γ(q2)

xq1−1

(x+1)(q1+q2) 1(0,∞) (x)

Rayleigh 2αxe−αx21[0,∞] (x)

Standard Cauchy 1π

11+x2

Cau(α) 1π

αα2+x2

Cau(α, d) Γ(d)√παΓ(d− 1

2)1(

1+( xα)2)d

Log Normal eN(µ,σ2) 1σx√

2πe−

12( ln x−µ

σ )2

1(0,∞) (x)

Table 3: Examples of probability density functions. Here, c, α, q, q1, q2, σ, λ are all strictlypositive and d > 1

2. γ = −ψ(1) ≈ .5772 is the Euler-constant. ψ (z) = d

dzlog Γ (z) =

(log e) Γ′(z)Γ(z)

is the digamma function. B(q1, q2) = Γ(q1)Γ(q2)Γ(q1+q2)

is the beta function.

61

• Use with caution to represent ignorance about a parameter taking value in [a, b].

6.4. EX = a+b2

, VarX = (b−a)2

12, E [X2] = 1

3(b2 + ab+ a2).

6.5. The product X of two independent U [0, 1] has

fX (x) = − (ln (x)) 1[0,1] (x)

andFX (x) = x− x lnx

on [0, 1]. This comes from P [X > x] =1∫0

1− FU(xt

)dt =

1∫x

1− xtdt.

6.2 Gaussian Distribution

6.6. Gaussian distribution:

(a) Denoted by N (m,σ2) . N (0, 1) is the standard Gaussian (normal) distribution.

(b) fX (x) = 1√2πσ

e−12(x−mσ )

2

.

(c) FX(x) = normcdf(x,m,sigma).

• The standard normal cdf is sometimes denoted by Φ(x). It inherits all propertiesof cdf. Moreover, note that Φ(−x) = 1− Φ(x).

(d) ϕX (v) = E[ejvX

]= ejmv−

12v2σ2

.]

(e) MX(s) = esm+ 12s2σ2

(f) Fourier transform: F fX =∞∫−∞

fX (x) e−jωxdt = e−jωm−12ω2σ2

.

(g) P [X > x] = P [X ≥ x] = Q(x−mσ

)= 1− Φ

(x−mσ

)= Φ

(−x−m

σ

)

P [X < x] = P [X ≤ x] = 1−Q(x−mσ

)= Q

(−x−m

σ

)= Φ

(x−mσ

).

23) Fourier transform: ( ) ( )2 21

2j mj x

X Xf f x e dt eω ω σω

∞− −−

−∞

= =∫F .

24) Note that 2xe dxα π

α

∞−

−∞

=∫ .

25) [ ] x mP X x Qσ−⎛ ⎞> = ⎜ ⎟

⎝ ⎠; [ ] 1 x m xP X x Q Q

σ σ− −⎛ ⎞ ⎛< = − = −⎜ ⎟ ⎜

⎝ ⎠ ⎝m ⎞⎟⎠

.

• 0.6827, 0.3173P X P Xμ σ μ σ⎡ − < ⎤ = ⎡ − > ⎤ =⎣ ⎦ ⎣ ⎦

2 0.0455, 2 0.9545P X P Xμ σ μ σ⎡ − > ⎤ = ⎡ − < ⎤ =⎣ ⎦ ⎣ ⎦

μ σ+μ σ− μ

68%

( )Xf x

μ 2μ σ+ 2μ σ−

( )Xf x

95%

26) Q-function: ( )2

212

x

z

Q z e dxπ

∞−

= ∫ corresponds to [ ]P X z> where ;

that is

( )~ 0,1X N

( )Q z is the probability of the “tail” of ( )0,1N .

( )0,1N

z 0

( )Q z

0.5

a) Q is a decreasing function with ( ) 102

Q = .

b) ( ) ( )1Q z Q z− = −

c) ( )( )1 1Q Q z− − = z−

d) ( )2

22

2 sin

0

1 x

Q x e d

π

θ θπ

−= ∫ . ( )

2

24

2 2 sin

0

1 x

Q x e d

π

θ θπ

−= ∫ .

e) ( )2

212

xd Q x edx π

−= − ; ( )( )

( )( )( )

2

212

f xd dQ f x e f xdx dxπ

−= − .

Figure 14: Probability density function of X ∼ N (m,σ2) .

6.7. Properties

62

(a) P [|X − µ| < σ] = 0.6827; P [|X − µ| > σ] = 0.3173P [|X − µ| > 2σ] = 0.0455; P [|X − µ| < 2σ] = 0.9545

(b) Moments and central moments:

(i) E[(X − µ)k

]= (k − 1) E

[(X − µ)k−2

]=

0, k odd1 · 3 · 5 · · · · · (k − 1)σk, k even

(ii) E[|X − µ|k

]=

2 · 4 · 6 · · · · · (k − 1)σk

√2π, k odd

1 · 3 · 5 · · · · · (k − 1)σk, k even

(iii) Var [X2] = 4µ2σ2 + 2σ4.

n 0 1 2 3 4EXn 1 µ µ2 + σ2 µ (µ2 + 3σ2) µ4 + 6µ2σ2 + 3σ4

E [(X − µ)n] 1 0 σ2 0 3σ4

(c) For N (0, 1) and k ≥ 1,

E[Xk]

= (k − 1) E[Xk−2

]=

0, k odd

1 · 3 · 5 · · · · · (k − 1) , k even.

The first equality comes from integration by parts. Observe also that

E[X2m

]=

(2m)!

2mm!.

(d) Levy–Cramer theorem: If the sum of two independent non-constant random variablesis normally distributed, then each of the summands is normally distributed.

• Note that∞∫−∞

e−αx2dx =

√πα

.

6.8 (Length bound). For X ∼ N (0, 1) and any (Borel) set B,

P [X ∈ B] ≤∫ |B|/2

−|B|/2fX(x) = 1− 2Q

( |B|2

),

where |B| is the length (Lebesgue measure) of the set B. This is because the probabilityis concentrated around 0. More generally, for X ∼ N (m,σ2)

P [X ∈ B] ≤ 1− 2Q

( |B|2σ

).

6.9 (Stein’s Lemma). Let X ∼ N (µ, σ2), and let g be a differentiable function satisfyingE |g′(X)| <∞. Then

E [g(X)(X − µ)] = σ2E [g′(X)] .

[2, Lemma 3.6.5 p 124]. Note that this is simply integration by parts with u = g(x) anddv = (x− µ)fX(x)dx.

63

• E[(X − µ)k

]= E

[(X − µ)k−1(X − µ)

]= σ2(k − 1)E

[(X − µ)k−2

].

6.10. Q-function : Q (z) =∞∫z

1√2πe−

x2

2 dx corresponds to P [X > z] where X ∼ N (0, 1);

that is Q (z) is the probability of the “tail” of N (0, 1). The Q function is then a comple-mentary cdf (ccdf).

( )0,1N

z 0

( )Q z

-3 -2 -1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

z

10,2

⎛ ⎞⎜ ⎟⎝ ⎠

N

z

( )erf z

0

( )2Q z

Figure 15: Q-function

(a) Q is a decreasing function with Q (0) = 12.

(b) Q (−z) = 1−Q (z) = Φ(z)

(c) Q−1 (1−Q (z)) = −z

(d) Craig’s formula: Q (x) = 1π

π2∫

0

e−x2

2 sin2 θ dθ = 1π

π2∫

0

e−x2

2 cos2 θ dθ, x ≥ 0.

To see this, consider X, Yi.i.d.∼ N (0, 1). Then,

Q (z) =

∫∫

(x,y)∈(z,∞)×R

fX,Y (x, y)dxdy = 2

π2∫

0

∞∫

zcos θ

fX,Y (r cos θ, r sin θ)drdθ.

where we evaluate the double integral using polar coordinates [9, Q7.22 p 322].

(e) Q2 (x) = 1π

π4∫

0

e−x2

2 sin2 θ dθ

(f) ddxQ (x) = − 1√

2πe−

x2

2

(g) ddxQ (f (x)) = − 1√

2πe−

(f(x))2

2ddxf (x)

(h)∫Q (f (x)) g (x)dx = Q (f (x))

∫g (x)dx+

∫1√2πe−

(f(x))2

2

(ddxf (x)

)( x∫a

g (t) dt

)dx

64

(i) P [X > x] = Q(x−mσ

)

P [X < x] = 1−Q(x−mσ

)= Q

(−x−m

σ

).

(j) Approximation:

(i) Q (z) ≈[

1(1−a)z+a

√z2+b

]1√2πe−

z2

2 ; a = 1π, b = 2π

(ii)(1− 1

x2

)e−

x2

2

x√

2π≤ Q (x) ≤ 1

2e−

x2

2

(iii) Q (z) ≈ 1z√

2π

(1− 0.7

z2

)e−

z2

2 ; z > 2

6.11. Error function (MATLAB): erf (z) = 2√π

z∫0

e−x2dx = 1− 2Q

(√2z)

(a) It is an odd function of z.

(b) For z ≥ 0, it corresponds to P [|X| < z] where X ∼ N(0, 1

2

).

(c) limz→∞

erf (z) = 1

(d) erf (−z) = −erf (z)

(e) Q (z) = 12erfc

(z√2

)= 1

2

(1− erf

(z√2

))

(f) Φ(x) = 12

(1 + erf

(x√(2)

))= 1

2erfc

(− x√

2

)

(g) Q−1 (q) =√

2 erfc−1 (2q)

(h) The complementary error function: erfc (z) = 1−erf (z) = 2Q(√

2z)

= 2√π

∫∞ze−x

2dx

f) ( )( ) ( ) ( )( ) ( )( )( )

( ) ( )2

212

f x x

a

dQ f x g x dx Q f x g x dx e f x g t dt dxdxπ

− ⎛ ⎞⎛ ⎞= + ⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠

∫ ∫ ∫ ∫ .

g) Approximation:

i) ( )( )

2

22

1 121

z

Q z ea z a z b π

−⎡ ⎤⎢ ⎥≈⎢ ⎥− + +⎣ ⎦

; 1aπ

= , 2b π=

ii) ( )

22

22

2

1 1122

xxe Q x e

x x π

−−⎛ ⎞− ≤ ≤⎜ ⎟

⎝ ⎠.

27) Moment and central moment

n 0 1 2 3 4

nXE 1 μ 2 2μ σ+ ( )2 23μ μ σ+ 4 2 26 3 4μ μ σ σ+ +

( )nX μ⎡ −⎣E ⎤⎦ 1 0 2σ 0 43σ

• ( ) ( ) ( ) ( )2 0, odd

11 3 5 1 , even

k kk

kX k X

k kμ μ

σ− ⎧⎡ ⎤ ⎡ ⎤− = − − = ⎨⎣ ⎦ ⎣ ⎦ ⋅ ⋅ ⋅ ⋅ −⎩

E E

• ( )

( )

22 4 6 1 , odd

1 3 5 1 , even

kk

k

k kX

k k

σμ π

σ

⎧⋅ ⋅ ⋅ ⋅ −⎪⎡ ⎤− = ⎨⎣ ⎦ ⎪ ⋅ ⋅ ⋅ ⋅ −⎩

E [Papoulis p 111].

• 2 2 2Var 4 2X 4μ σ σ⎡ ⎤ = +⎣ ⎦ .

28) For ( )0,1N and , 1k ≥ ( ) ( )2 0, odd

11 3 5 1 , even

k k kX k X

k k− ⎧

⎡ ⎤ ⎡ ⎤= − = ⎨⎣ ⎦ ⎣ ⎦ ⋅ ⋅ ⋅ ⋅ −⎩E E

29) Error function (Matlab): ( ) (2

0

2 1 2 2z

xerf z e dx Q zπ

−= = −∫ ) corresponds to

P X z⎡ <⎣ ⎤⎦ where X ~ 10,2

⎛ ⎞⎜ ⎟⎝ ⎠

N .

10,2

⎛ ⎞⎜ ⎟⎝ ⎠

N

z

( )erf z

0

( )2Q z

a) ( )lim 1

zerf z

→∞=

b) ( ) (erf z erf z− = − )Figure 16: erf-function and Q-function

65

6.3 Exponential Distribution

6.12. Denoted by E (λ). It is in fact Γ (1, λ). λ > 0 is a parameter of the distribution,often called the rate parameter.

6.13. Characterized by

• fX (x) = λe−λxU(x) ;

• FX (x) =(1− e−λx

)U(x) ;

• Survival-, survivor-, or reliability-function: P [X > x] = e−λx1[0,∞)(x) + 1(−∞,0)(x);

• ϕX (u) = λλ−iu .

• MX(s) = λλ−s for Re s < λ.

6.14. EX = σX = 1λ, Var [X] = 1

λ2 .

6.15. median(X) = 1λ

ln 2, mode(X) = 0, E [Xn] = n!λn

.

6.16. Coefficient of variation: CV = σXEX = 1

6.17. It is a continuous version of geometric distribution. In fact, bXc ∼ G0(e−λ) anddXe ∼ G1(e−λ)

6.18. X ∼ E(λ) is simply 1λX1 where X1 ∼ E(1).

6.19. Suppose X1 ∼ E(1). Then E [Xn1 ] = n! for n ∈ N ∪ 0.

In general, for X ∼ E(λ), we have E [Xα] = 1λα

Γ(α + 1) for any α > −1. In particular, forn ∈ N ∪ 0, the moment E [Xn] = n!

λn.

6.20. µ3 = E [(X − EX)3] = 2λ3 and µ4 = 9

λ4 .

6.21. Hazard function: f(x)P [X>x]

= λ

6.22. h(X) = log eλ.

6.23. Can be generated by X = − 1λ

lnU where U ∼ U (0, 1).

6.24. MATLAB:

• X = exprnd(1/lambda)

• fX(x) = exppdf(x,1/lambda)

• FX(x) = expcdf(x,1/lambda)

6.25. Memoryless property : The exponential r.v. is the only continuous r.v. on [0,∞)that satisfies the memoryless property:

P [X > s+ x |X > s ] = P [X > x]

for all x > 0 and all s > 0 [14, p. 157–159]. In words, the future is independent of thepast. The fact that it hasn’t happened yet, tells us nothing about how much longer it willtake before it does happen.

66

• In particular, suppose we define the set B + x to be x+ b : b ∈ B. For any x > 0and set B ⊂ [0,∞), we have

P [X ∈ B + x|X > x] = P [X ∈ B]

becauseP [X ∈ B + x]

P [X > x]=

∫B+x

λe−λtdt

e−λxτ=t−x

=

∫Bλe−λ(τ+x)dτ

e−λx.

6.26. The difference of two independent E(λ) is L(λ). In particular, suppose X and Y

are i.i.d. E(λ), then E|X − Y | = σX = 1λ

and√

E [(X − Y )2] =√

2λ

.

6.27. Consider independent Xi ∼ E (λ). Let Sn =∑n

i=1Xi.

(a) Sn ∼ Γ(n, λ), i.e. its has n-Erlang distribution.

(b) Let N = inf n : Sn ≥ s. Then, N ∼ Λ + 1 where Λ ∼ P (λs).

6.28. If independent Xi ∼ E (λi), then

(a) miniXi ∼ E

(∑i

λi

).

Recall order statistics. Let Y1 = miniXi

(b) P[miniXi = j

]=

λj∑iλi. Note that this is

∞∫0

fXj (t)∏i 6=j

P [Xi > t]dt.

6.29. If Si, Tii.i.d.∼ E (α), then

P

(m∑

i=1

Si >n∑

j=1

Tj

)=

m−1∑

i=0

(n+m− 1

i

)(1

2

)n+m−1

=m−1∑

i=0

(n+ i− 1

i

)(1

2

)n+i

.

Note that we can set up two Poisson processes. Consider the superposed process. We wantthe nth arrival from the T processes to come before the mth one of the S process.

6.4 Pareto: Par(α)–heavy-tailed model/density

6.30. Characterizations: Fix α > 0.

(a) f (x) = αx−α−1U (x− 1)

(b) F (x) =(1− 1

xα

)U (x− 1) =

0 x < 11− 1

xαx ≥ 1

Example 6.31.

67

• distribution of wealth

• flood heights of the Nile river

• designing dam height

• (discrete) sizes of files requested by web users

• waiting times between successive keystrokes at computer terminals

• (discrete) sizes of files stored on Unix system file servers

• running times for NP-hard problems as a function of certain parameters

6.5 Laplacian: L(α)

6.32. Characterization: α > 0

(a) Also known as Laplace or double exponential.

(b) f (x) = α2e−α|x|

(c) F (x) =

12eαx x < 0

1− 12e−αx x ≥ 0

(d) ϕX (u) = α2

α2+u2

(e) MX(s) = λ2

λ2−s2 , −λ < Re s < λ.

6.33. EX = 0, VarX = 2α2 , E |X| = 1

α.

Example 6.34.

• amplitudes of speech signals

• amplitudes of differences of intensities between adjacent pixels in an image

• If X and Y are independent E(λ), then X − Y is L(λ). (Easy proof via ch.f.)

6.6 Rayleigh

6.35. Characterizations:

(a) F (x) =(

1− e−αx2)U (x)

(b) f (x) = 2αxe−αx2u (x)

(c) P [X > t] = 1− F (t) =

e−αt

2, t ≥ 0

1, t < 0

(d) Use√−2σ2 lnU to generate Rayleign

(1σ2

)from U ∼ U(0, 1).

68

6.36. Read “ray′-lee”

Example 6.37.

• Noise X at the output of AM envelope detector when no signal is present

6.38. Relationship with other distributions

(a) Let X be a Rayleigh(α) r.v., then Y = X2 is E(α). Hence,

E(α)

√·−−−−

(·)2Rayleigh(α). (18)

(b) Suppose X, Yi.i.d.∼ N (0, σ2). R =

√X2 + Y 2 has a Rayleigh distribution with density

fR(r) = 2r1

2σ2e−

12σ2 r

2

. (19)

• Note that X2, Y 2 i.i.d.∼ Γ(

12, 1

2σ2

). Hence, X2 + Y 2 ∼ Γ

(1, α = 1

2σ2

), exponential.

By (18),√X2 + Y 2 is a Rayleigh r.v. α = 1

2σ2 .

• Alternatively, the transformation from Cartesian coordinates(xy

)to polar coor-

dinates(rθ

)gives

fR,Θ (r, θ) = rfX,Y (r cos θ, r sin θ) = r1

σ√

2πe−

12( r cos θ

σ )2 1

σ√

2πe−

12( r sin θ

σ )2

=

(1

2π

)(2r

1

2σ2e−

12σ2 r

2

).

Hence, the radius R and the angle Θ are independent, with the radius R havinga Rayleigh distribution while the angle Θ is uniformly distributed in the interval(0, 2π).

6.7 Cauchy

6.39. Characterizations: Fix α > 0.

(a) fX (x) = απ

1α2+x2 , α > 0

(b) FX(x) = 1π

tan− 1(xλ

)+ 1

2.

(c) ϕX (u) = e−α|u|

Note that Cau(α) is simply αX1 where X1 ∼ Cau(1). Also, because fX is even, f−X = fXand thus −X is still ∼ Cau(α).

6.40. Odd moments are not defined; even moments are infinite.

• Because the first moment is not defined, central moments, including the variance, arenot defined.

69

• Mean and variance do not exist.

• Note that even though the pdf of Cauchy distribution is an even function, the meanis not 0.

6.41. Suppose Xi are independent Cau(αi). Then,∑

i aiXi ∼ Cau(|ai|αi).

6.42. Suppose X ∼ Cau(α). 1X∼ Cau

(1α

).

6.8 More PDFs

6.43. Beta

(a) fβq1,q2 (x) = Γ(q1+q2)Γ(q1)Γ(q2)

xq1−1 (1− x)q2−1 1(0,1) (x), x ∈ (0, 1)

(b) MATLAB:X = betarnd(q1,q2), fX(x) = betapdf(x,q1,q2), and FX(x) = betacdf(x,q1,q2)

(c) For symmetric distributions q1 and q2 are the same. As their values increase thedistributions become more peaked.

(d) The uniform distribution has q1 = q2 = 1.

(e) E[X i (1−X)j

]= Γ(q1+q2)

Γ(q1+q2+i+j)Γ(q1+i)

Γ(q1)Γ(q2+j)

Γ(q2)

(f) E [X] = q1q1+q2

and VarX = q1q2(q1+q2)2(q1+q2+1)

.

(g) Parameters for random variable with mean m and variance σ2:

(i) Suppose we want E [X] = m and VarX = σ2, then we need q1 = m(

(1−m)mσ2 − 1

)

and q2 = (1−m)(

(1−m)mσ2 − 1

)

(ii) The support of Y = aX is [0, a]. Suppose we want E [Y ] = m and VarY = σ2,

then we set q1 = ma

((a−m)m

σ2 − 1)

and q2 =(1− m

a

) ((a−m)m

σ2 − 1)

6.44. Rice/Rician

(a) Characterizations: fix v ≥ 0 and σ > 0,

fR(r) =r

σ2exp

(−(r2 + v2)

2σ2

)I0

(rvσ2

)1[r > 0]

where I0 is the modified Bessel function4 of the first kind with order zero.

(b) When v = 0, the distribution reduces to a Rayleigh distribution given in (19).

4I0 (z) = 1π

π∫0

ez cos θdθ. In MATLAB, use besseli(0,z) for I0(z).

70

(c) Suppose we have independent Ni ∼ N (mi, σ2), then R =

√N2

1 +N22 is Rician with

v =√m2

1 +m22 and the same σ. To see this, use (30) and the fact that we can express

m1 = v cosφ and m2 = v sinφ for some φ.

(d) E [R2] = 2σ2 + v2, E [R4] = 8σ4 + 8σ2v2 + v4.

6.45. Weibull : For λ > 0 and p > 0, the Weibull(p, λ) distribution [9] is characterizedby

(a) X =(Yλ

) 1p where Y ∼ E(1).

(b) fX(x) = λpxp−1e−λxp, x > 0

(c) FX(x) = 1− e−λxp , x > 0

(d) E [Xn] =Γ(1+n

p )λnp

.

7 Expectation

Consider probability space (Ω,A, P )

7.1. Let X+ = max (X, 0), and X− = −min (X, 0) = max (−X, 0). Then, X = X+−X−,and X+, X− are nonnegative r.v.’s. Also, |X| = X+ +X−

7.2. A random variable X is integrable if and only if

≡ X has a finite expectation

≡ both E [X+] and E [X−] are finite

≡ E |X| is finite.

≡ EX is finite ≡ EX is defined.

≡ X ∈ L

≡ |X| ∈ L

In which case,EX = E

[X+]− E

[X−]

=

∫X (ω)P (dω) =

∫XdP

=

∫xdPX(x) =

∫xPX(dx)

and ∫

A

XdP = E [1AX] .

71

Definition 7.3. A r.v. X admits (has) an expectation if E [X+] and E [X−] are notboth equal to +∞. Then, the expectation of X is still given by EX = E [X+] − E [X−]with the conventions +∞+ a = +∞ and −∞+ a = −∞ when a ∈ R

7.4. L 1 = L 1 (Ω,A, P ) = the set of all integrable random variables.

7.5. For 1 ≤ p <∞, the following are equivalent:

(a) (E [Xp])1p = 0

(b) E [Xp] = 0

(c) X = 0 a.s.

7.6. Xa.s.= Y ⇒ EX = EY

7.7. E [1B (X)] = P (X−1 (B)) = PX (B) = P [X ∈ B]

• FX(x) = E[1(−∞,x](X)

]

7.8. Expectation rule : Let X be a r.v. on (Ω,A, P ), with values in (E, E), and distri-bution PX . Let h : (E, E)→ (R,B) be measurable. If

• X ≥ 0 or

• h (X) ∈ L 1 (Ω,A, P ) which is equivalent to h ∈ L 1(E, E , PX

)

then

• E [h (X)] =∫h (X (ω))P (dω) =

∫h (x)PX (dx)

•∫

[X∈G]

h (X (ω))P (dω) =∫G

h (x)PX (dx)

7.9. Expectation of an absolutely continuous random variable : Suppose X hasdensity fX , then h is PX-integrable if and only if h · fX is integrable w.r.t. Lebesguemeasure. In which case,

E [h (X)] =

∫h (x)PX (dx) =

∫h (x) fX (x) dx

and ∫

G

h (x)PX (dx) =

∫

G

h (x) fX (x) dx

• Caution: Suppose h is an odd function and fX is an even function, we can notconclude that E [h(X)] = 0. One obvious odd-function h is h(x) = x. For example,in (6.40), when X is Cauchy, the expectation does not exist even though the pdf isan even funciton. Of course, in general, if we also know that h(X) is integrable, thenE [h(X)] is 0.

72

Expectation of a discrete random variable: Suppose x is a discrete random variable.

EX =∑

x

xP [X = x]

andE [g(X)] =

∑

x

g(x)P [X = x].

Similarly,

E [g(X, Y )] =∑

x

∑

y

= g(x, y)P [X = x, Y = y].

These are called the law/rule of the lazy statistician (LOTUS) [23, Thm 3.6 p 48],[9,p. 149].

7.10.∫E

P [X ≥ t] dt =∫E

P [X > t] dt and∫E

P [X ≤ t] dt =∫E

P [X < t] dt

7.11. Expectation and cdf :

(a) For nonnegative X,

EX =

∞∫

0

P [X > y] dy =

∞∫

0

(1− FX (y)) dy =

∞∫

0

P [X ≥ y] dy (20)

For p > 0,

E [Xp] =

∞∫

0

pxp−1P [X > x]dx.

(b) For integrable X,

EX =

∞∫

0

(1− FX (x)) dx−0∫

−∞

FX (x) dx

(c) For nonnegative integer-valued X, EX =∑∞

k=0 P [X > k] =∑∞

k=1 P [X ≥ k].

Definition 7.12.

(a) Absolute moment : E[|X|k

]=∫|x|k PX (dx), where we define E

[|X|0

]= 1

(b) Moment : If E[|X|k

]<∞, then mk = E

[Xk]

=∫xkPX (dx) = the kth moment of

X

(c) Variance : If E [X2] <∞, then we define

VarX = E[(X − EX)2] =

∫(x− EX)2PX (dx) = E

[X2]− (EX)2

= E [X(X − EX)]

73

• Notation: DX , or σ2 (X), or σ2X , or VX [23, p. 51]

• If Xi ∈ L2, thenk∑i=1

Xi ∈ L2, Var

[k∑i=1

Xi

]exists, and E

[k∑i=1

Xi

]=

k∑i=1

EXi

• Suppose E [X2] <∞. If VarX = 0, then X = EX a.s.

(d) Standard Deviation : σX =√

Var[X].

(e) Coefficient of Variation : CVX = σXEX .

• It is the standard deviation of the “normalized” random variable XEX .

• 1 for exponential.

(f) Fano Factor (index of dispersion): VarXEX .

• 1 for Poisson.

(g) Central Moments : the nth central moment is µn = E [(X − EX)n].

(i) µ1 = E [X − EX] = 0.

(ii) µ2 = σ2X = VarX

(iii) µn =n∑k=1

(nk

)mn−k (−m1)k

(iv) mn =n∑k=1

(nk

)µn−km

k1

(h) Skewness coefficient: γX = µ3

σ3X

(i) Describe the deviation of the distribution from a symmetric shape (around themean)

(ii) 0 for any symmetric distribution

(i) Kurtosis: κX = µ4

σ4X

.

• κX = 3 for Gaussian distribution

(j) Excess coefficient : εX = κX − 3 = µ4

σ4X− 3

(k) Cumulants or semivariants: For one variable: γk = 1jk

∂k

∂vkln (ϕX (v))

∣∣∣v=0

.

(i) γ1 = EX = m1,γ2 = E [X − EX]2 = m2 −m2

1 = µ2

γ3 = E [X − EX]3 = m3 − 3m1m2 + 2m31 = µ3

γ4 = m4 − 3m22 − 4m1m3 + 12m2

1m2 − 6m41 = µ4 − 3µ2

2

(ii) m1 = γ1

m2 = γ2 + γ21

m3 = γ2 + 3γ1γ2 + γ31

m4 = γ4 + 3γ22 + 4γ1γ3 + 6γ2

1γ2 + γ41

74

P1:R

akeshT

un

g.clsT

un

g˙C02

July

28,200514:27

TABLE 2.1 Product-Moments of Random Variables

Moment Measure of Definition Continuous variable Discrete variable Sample estimator

First Central Mean, expected value µx =∫ ∞

−∞ x f x(x) dx µx = ∑all x′s xk p(xk) x = ∑ xi/nlocation E(X ) = µx

Second Dispersion Variance, Var(X ) = µ2 = σ 2x σ 2

x =∫ ∞

−∞ (x − µx)2 f x(x) dx σ 2x =

∑all x′s(xk − µx)2 Px(xk) s2 = 1

n−1

∑(xi − x)2

Standard deviation, σx σx =√

Var(X ) σx =√

Var(X ) s =√

1n−1

∑(xi − x)2

Coefficient of variation, x x = σx/µx x = σx/µx Cv = s/x

Third Asymmetry Skewness µ3 =∫ ∞

−∞ (x − µx)3 f x(x) dx µ3 =∑

all x′s (xk − µx)3 px(xk) m3 = n(n−1)(n−2)

∑(xi − x)3

Skewness coefficient, γx γx = µ3/σ 3x γx = µ3/σ 3

x g = m3/s3

Fourth Peakedness Kurtosis, κx µ4 =∫ ∞

−∞ (x − µx)4 f x(x) dx µ4 =∑

all x′s (xk − µx)4 px(xk) m4 = n(n+1)(n−1)(n−2)(n−3)

∑(xi − x)4

Excess coefficient, εx κx = µ4/σ 4x κx = µ4/σ 4

x k = m4/s4

εx = κx − 3 εx = κx − 3

37

Figure 17: Product-Moments of Random Variables [22]

Model E [X] Var [X]

U0,1,...,n−1n−1

2n2−1

12

B(n, p) np np(1-p)

G(β) β1−β

β

(1−β)2

P(λ) λ λ

U(a, b) a+b2

(b−a)2

12

E(λ) 1λ

1λ2

Par(α)

αα−1

, α > 1

∞, 0 < α ≤ 1

undefined, 0 < α < 1∞, 1 < α < 2

α(α−2)(α−1)2

, α > 2

L(α) 0 2α2

N (m,σ2) m σ2

N (m,Λ) m Λ = [Cov [Xi, Xj]]Γ (q, λ) q

λqλ2

Table 4: Expectations and Variances

7.13.

• For c ∈ R, E [c] = c

• E [·] is a linear operator: E [aX + bY ] = aEX + bEY .

• In general, Var[·] is not a linear operator.

7.14. All pairs of mean and variance are possible. A random variable X with EX = mand VarX = σ2 can be constructed by setting P [X = m− a] = P [X = m+ a] = 1

2.

Definition 7.15.

• Correlation between X and Y : E [XY ].

75

• Covariance between X and Y :

Cov [X, Y ] = E [(X − EX)(Y − EY )] = E [XY ]− EXEY= E [X(Y − EY )] = E [Y (X − EX)] .

• X and Y are said to be uncorrelated if and only if Cov [X, Y ] = 0.

≡ E [XY ] = EXEY

• X and Y are said to be orthogonal if E [XY ] = 0.

• Correlation coefficient, autocorrelation, normalized covariance :

ρXY =Cov [X, Y ]

σXσY= E

[(X − EXσX

)(Y − EYσY

)]=

E [XY ]− EXEYσXσY

.

7.16. Properties

(a) VarX = Cov [X,X], ρX,X = 1

(b) Var[aX] = a2 VarX, σaX = |a|σX .

(c) If X |= Y , then Cov [X, Y ] = 0. The converse is not true.

(d) ρXY ∈ [−1, 1].

(e) By Caychy-Schwartz inequality, (Cov [X, Y ])2 ≤ σ2Xσ

2Y with equality if and only if

σ2Y (X − EX)2 = σ2

X(Y − EY )2 a.s.

• This implies |ρX,Y | ≤ 1.

When σY , σX > 0, equality occurs if and only if the following conditions holds

≡ ∃a 6= 0 such that (X − EX) = a(Y − EY )

≡ ∃c 6= 0 and b ∈ R such that Y = cX + b

≡ ∃a 6= 0 and b ∈ R such that X = aY + b

≡ |ρXY | = 1

In which case, |a| = σXσY

and ρXY = a|a| = sgn a. Hence, pXY is used to quantify linear

dependence between X and Y . The closer |ρXY | to 1, the higher degree of lineardependence between X and Y .

(f) Linearity:

(i) Let Yi = aiXi + bi.

i. Cov [Y1, Y2] = Cov [a1X1 + b1, a2X2 + b2] = a1a2Cov [X1, X2].

ii. The ρ is preserved under linear transformation:

ρY1,Y2 = ρX1,X2 .

76

(ii) Cov [a1X + b1, a2X + b2] = a1a2 VarX.

(iii) ρa1X+b1,a2X+b2 = 1. In particular, if Y = aX + b, then ρX,Y = 1.

(g) ρX,Y = 0 if and only if X and Y are uncorrelated.

(h) When EX = 0 or EY = 0, orthogonality is equivalent to uncorrelatedness.

(i) For finite index set I,

Var

[∑

i∈I

aiXi

]=∑

i∈I

a2i VarXi + 2

∑

(i,j)∈I×Ii 6=j

ajajCov [Xi, Xj] .

In particularVar (X + Y ) = VarX + VarY + 2Cov [X, Y ]

andVar (X − Y ) = VarX + VarY − 2Cov [X, Y ] .

(j) For finite index set I and J ,

Cov

[∑

i∈I

aiXi,∑

j∈J

bjYj

]=∑

i∈I

∑

j∈J

aibjCov [Xi, Yj] .

(k) Covariance Inequality: Let X be any random variable and g and h any function suchthat E [g(X)], E [h(X)], and E [g(X)h(X)] exist.

• If g and h are either both non-decreasing or non-increasing, then

Cov [g(X), h(X)] ≥ 0. (21)

• If g is non-decreasing and h is non-increasing, then

Cov [g(X), h(X)] ≤ 0. (22)

See also [2, p. 191–192] and (8.13).

(l) Being uncorrelated does not imply independence

• Discrete: Suppose pX is an even function with pX(0) = 0. Let Y = g(X) where gis also an even function. Then, E [XY ] = E [X] = E [X] E [Y ] = Cov [X, Y ] = 0.Consider a point x0 such that pX(x0) > 0. Then, pX,Y (x0, g(x0)) = pX(x0). Weonly need to show that pY (g(x0)) 6= 1 to show that X and Y are not independent.

For example, let X be uniform on ±1,±2 and Y = |X|. Consider the pointx0 = 1.

• Continuous: Let Θ be uniform on an interval of length 2π. Set X = cos Θ andY = sin Θ. See (11.6).

77

7.17. Suppose X and Y are i.i.d. random variables. Then,√

E [(X − Y )2] =√

2σX .

7.18. See (4.49) for relationships between expectation and independence.

Example 7.19 (Martingale betting strategy). Fix a > 0. Suppose X0, X1, X2, . . . areindependent random variables with P [Xi = 1] = p and P [Xi = 0] = 1 − p. Let N =inf i : Xi = 1. Also define

L(N) =

0, N = 0

a

(N−1∑i=0

ri), N ∈ N

and G(N) = arN − L(N). To have G(N) > 0, needk−1∑i=0

ri < rk ∀k ∈ N which turns out to

require r ≥ 2. In fact, for r ≥ 2, we have G(N) ≥ a ∀N ∈ N ∪ 0. Hence, E [G(N)] ≥ a.It is exactly a when r = 2.

Now, E [L (N)] = a∑∞

n=1rn−1r−1

p (1− p)n = ∞ if and only if r(1 − p) ≥ 1. When

1− p ≤ 12, because we already have r ≥ 2, it is true that r(1− p) ≥ 1.

8 Inequalities

8.1. Let (Ai : i ∈ I) be a finite family of events. Then

(∑i

P (Ai)

)2

∑i

∑j

P (Ai ∩ Aj)≤ P

(⋃

i

Ai

)≤∑

i

P (Ai)

8.2. [20, p. 14]

−(

1− 1

n

)n

↓−1e≈ −0.37

≤ P

(n⋂

i=1

Ai

)−

n∏

i=1

P (Ai) ≤ (n− 1)n−nn−1

↓1

.

See figure 18.

• |P (A1 ∩ A2)− P (A1)P (A2)| ≤ 14.

8.3. Markov’s Inequality : P [|X| ≥ a] ≤ 1aE |X|, a > 0.

(a) Useless when a ≤ E |X|. Hence, good for bounding the “tails” of a distribution.

(b) Remark: P [|X| > a] ≤ P [|X| ≥ a]

(c) P [|X| ≥ aE |X|] ≤ 1a, a > 0.

78

Classical Probability 1) Birthday Paradox: In a group of 23 randomly selected people, the probability that

at least two will share a birthday (assuming birthdays are equally likely to occur on any given day of the year) is about 0.5. See also (3).

Events

2) ( ) ( ) 1

111

1 0.37

11 1n nn n

ni i

ii

e

P A P A n nn

−−

↓==↓

− ≈−

⎛ ⎞⎛ ⎞− − ≤ − ≤ −⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠

∏∩ [Szekely’86, p 14].

2 4 6 8 10

0

0.5

11

1−

e

11

x−⎛⎜

⎝⎞⎟⎠

x−

x 1−( ) x

x−x 1−⋅

101 x

a) ( ) ( ) ( )1 2 1 214

P A A P A P A∩ − ≤

Random Variable

3) Let i.i.d. 1, , rX X… be uniformly distributed on a finite set 1, , na a… . Then, the

probability that 1, , rX X… are all distinct is ( )( )112

1

, 1 1 1r rr

nu

i

ip n r en

−− −

=

⎛ ⎞= − − ≈ −⎜ ⎟⎝ ⎠

∏ .

a) Special case: the birthday paradox in (1).

0 5 10 15 20 25 30 35 40 45 50 550

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

r

pu(n,r) for n = 365

23365

rn==

( ),up n r

( )121

r rne−

−−

0 50 100 150 200 250 300 3500

0.1

0.2

0.3

0.4

0.5

0.6

n

0.90.70.50.30.1

ppppp

=====

23365

rn==

rn

b) The approximation comes from 1 xx e+ ≈ . c) From the approximation, to have ( ),up n r p= , we need

Figure 18: Bound for P

(n⋂i=1

Ai

)−

n∏i=1

P (Ai).

a) a.s.

nX X 1

nX , 1X , and nX X .

b) P

nX X nX X .

64) Interchanging integration and summation.

a) If (1) 1

n

n

X

converges a.s., (2) 1

n

k

k

f g

a.e., and (3) g L , then

,n n

n

f f L and n n

n n

f d f d .

b) If 1

n

n

X

, then 1

n

n

X

converges absolutely a.s., and integrable.

Moreover, 1 1

n n

n n

X X

.

Inequality

65) Let :iA i I be a finite family of events. Then

2

i

i

i i

iii j

i j

P A

P A P AP A A

.

66) Markov’s Inequality: 1

X a Xa

, a > 0.

a) Useless when a X . Hence, good for bounding the “tails” of a

distribution.

b) Remark: P X a P X a .

c) 1

X a Xa

, a > 0.

d) Suppose g is a nonnegative function. Then, 0 and 0p , we have

i) 1 p

pg X g X

.

a x

a 1x a

a

x

Figure 19: Proof of Markov’s Inequality

79

(d) Suppose g is a nonnegative function. Then, ∀α > 0 and p > 0, we have

(i) P [g (X) ≥ α] ≤ 1αp

(E [(g (X))p])

(ii) P [g (X − EX) ≥ α] ≤ 1αp

(E [(g (X − EX))p])

(e) Chebyshev’s Inequality : P [|X| > a] ≤ P [|X| ≥ a] ≤ 1a2 EX2, a > 0.

(i) P [|X| ≥ α] ≤ 1αp

(E [|X|p])

(ii) P [|X − EX| ≥ α] ≤ σ2X

α2 ; that is P [|X − EX| ≥ nσX ] ≤ 1n2

• Useful only when α > σX

(iii) For a < b, P [a ≤ X ≤ b] ≥ 1− 4(b−a)2

(σ2X +

(EX − a+b

2

)2)

(f) One-sided Chebyshev inequalities: If X ∈ L2, for a > 0,

(i) If EX = 0, P [X ≥ a] ≤ EX2

EX2+a2

(ii) For general X,

i. P [X ≥ EX + a] ≤ σ2X

σ2X+a2 ; that is P [X ≥ EX + nσX ] ≤ 1

1+n2

ii. P [X ≤ EX − a] ≤ σ2X

σ2X+a2 ; that is P [X ≤ EX − nσX ] ≤ 1

1+n2

iii. P [|X − EX| ≥ a] ≤ 2σ2X

σ2X+a2 ; that is P [|X − EX| ≥ nσX ] ≤ 2

1+n2 This is a

better bound thanσ2X

a2 iff σX > a

(g) Chernoff bounds:

(i) P [X ≤ b] ≤ E[e−θX]e−θb

∀θ > 0

(ii) P [X ≥ b] ≤ E[eθX]eθb

∀θ > 0

This can be optimized over θ

8.4. Suppose |X| ≤M a.s., then P [|X| ≥ a] ≥ E|X|−aM−a ∀a ∈ [0,M)

8.5. X ≥ 0 and E [X2] <∞⇒ P [X > 0] ≥ (EX)2

E[X2]

Definition 8.6. If p and q are positive real numbers such that p+ q = pq, or equivalently,1p

+ 1q

= 1, then we call p and qa pair of conjugate exponents.

• 1 < p, q <∞• As p → 1, q → ∞. Consequently, 1 and ∞ are also regarded as a pair of conjugate

exponents.

8.7. Holder’s Inequality : X ∈ Lp, Y ∈ Lq, p > 1, 1p

+ 1q

= 1. Then,

(a) XY ∈ L1

80

(b) E [|XY |] ≤ (E [|X|p])1p (E [|Y |q])

1q with equality if and only if

E [|Y |q] |X (ω)|p = E [|X|p] |Y (ω)|q a.s.

8.8. Cauchy-Bunyakovskii-Schwartz Inequality: If X, Y ∈ L2, then XY ∈ L1 and

|E [XY ]| ≤ E [|XY |] ≤(E[|X|2

]) 12(E[|Y |2

]) 12 or equivalently,

(E [XY ])2 ≤(E[X2]) (

E[Y 2])

with equality if and only if E [Y 2]X2 = E [X2]Y 2 a.s.

(a) |Cov (X, Y )| ≤ σXσY .

(b) (EX)2 ≤ EX2

(c) (P (A ∩B))2 ≤ P (A)P (B)

8.9. Minkowski’s Inequality : p ≥ 1, X, Y ∈ Lp ⇒ X + Y ∈ Lp and (E [|X + Y |p])1p ≤

(E [|X|p])1p + (E [|Y |p])

1p

8.10. p > q > 0

(a) E [|X|q] ≤ 1 + E [|X|p]

(b) Lyapounov’s inequality : (E [|X|q])1q ≤ (E [|X|p])

1p

• E [|X|] ≤√

E[|X|2

]≤(E[|X|3

]) 13 ≤ · · ·

• E|X − Y | ≤√

E[(X − Y )2]

If X and Y are independent, then the RHS is√σ2X + σ2

Y + (EX − EY )2.

If X and Y are i.i.d., then the RHS is√

2σX .

8.11. Jensen’s Inequality : For a random variable X, if 1) X ∈ L1 (and ϕ (X) ∈ L1);2) X ∈ (a, b) a.s.; and 3) ϕ is convex on (a, b), then ϕ (EX) ≤ E [ϕ (X)]

• For X > 0 (a.s.), E(

1X

)≥ 1

EX .

8.12.

• For p ∈ (0, 1], E [|X + Y |p] ≤ E [|X|p] + E [|Y |p].• For p ≥ 1, E [|X + Y |p] ≤ 2p−1 (E [|X|p] + E [|Y |p]).

8.13 (Covariance Inequality). Let X be any random variable and g and h any functionsuch that E [g(X)], E [h(X)], and E [g(X)h(X)] exist.

• If g and h are either both non-decreasing or non-increasing, then

E [g(X)h(X)] ≥ E [g(X)] E [h(X)] .

In particular, for nondecreasing g, E [g(X)(X − EX)] ≥ 0.

• If g is non-decreasing and h is non-increasing, then

E [g(X)h(X)] ≤ E [g(X)] E [h(X)] .

See also (21), (22), and [2, p. 191–192].

81

9 Random Vectors

In this article, a vector is a column matrix with dimension n× 1 for some n ∈ N. We use1 to denote a vector with all element being 1. Note that 1(1T ) is a square matrix with allelement being 1. Finally, for any matrix A and constant a, we define the matrix A + a tobe the matrix A with each of the components are added by a. If A is a square matrix, thenA+ a = A+ a1(1T ).

Definition 9.1. Suppose I is an index set. When Xi’s are random variables, we de-fine a random vector XI by XI = (Xi : i ∈ I). For example, if I = [n], we have XI =(X1, X2, . . . , Xn). Note also that X[n] is usually denoted by Xn

1 . Sometimes, we simplywrite X to denote Xn

1 .

• For disjoint A,B, XA∪B = (XA, XB).

• For vector xI , yI , we say x ≤ y if ∀i ∈ I, xi ≤ yi [7, p 206].

• When the dimension of X is implicit, we simply write X and x to represent Xn1 and

xn1 , respectively.

• For random vector X, Y , we use (X, Y ) to represent the random vector(XY

)or

equivalently[XT Y T

]T.

Definition 9.2. Half-open cell or bounded rectangle in Rk is set of the form Ia,b =

x : ai < xi ≤ bi, ∀i ∈ [k] =k×i=1

(ai, bi]. For a real function F on Rk, the difference of F

around the vertices of Ia,b is

∆Ia,bF =

∑

v

(sgnIa,b (v)

)F (v) =

∑

v

(−1)|i:vi=ai| F (v) (23)

where the sum extending over the 2k vertices v of Ia,b. (The ith coordinate of the vertex vcould be either ai or bi.) In particular, for k = 2, we have

F (b1, b2)− F (a1, b2)− F (b1, a2) + F (a1, a2) .

9.3 (Joint cdf).

FX (x) = FXk1

(xk1)

= P [X1 ≤ x1, . . . , Xk ≤ xk] = P[X ∈ Sxk1

]= PX (Sx)

where Sx = y : yi ≤ xi, i = 1, . . . , k consists of the points “southwest” of x.

• ∆Ia,bFX ≥ 0

• The set Sx is an orthanlike or semi-infinite corner with “northeast” vertex (vertex inthe direction of the first orthant) specified by the point x [7, p 206].

C1 FX is nondecreasing in each variable. Suppose ∀i yi ≥ xi , then FX (y) ≥ FX (x)

C2 FX is continuous from above: limh0

FX (x1 + h, . . . , xk + h) = FX (x)

82

C3 xi → −∞ for some i (the other coordinates held fixed), then FX (x)→ 0If ∀i xi →∞, then FX (x)→ 1.

• limh0

FX (x1 − h, . . . , xk − h) = PX (Sx) where Sx = y : yi < xi, i = 1, . . . , k is the

interior of Sx

• Given a ≤ b, P [X ∈ Ia,b] = ∆Ia,bFX .

This comes from (10) with Ai = [Xi ≤ ai] and B = [∀ Xi ≤ bi]. Note that

P [∩i∈IAi ∩B] = F (v) where vi =

ai, i ∈ I,bi, otherwise.

• For any function F on Rk with satisfies (C1), (C2), and (C3), there is a uniqueprobability measure µ on BRk such that ∀a, ∀b ∈ Rk with a ≤ b, we have µ (Ia,b) =∆Ia,bF (and ∀x ∈ Rk µ (Sx) = F (x)).

• TFAE:

(a) FX is continuous at x

(b) FX is continuous from below

(c) FX (x) = PX (Sx)

(d) PX (Sx) = PX (Sx)

(e) PX (∂Sx) = 0 where ∂Sx = Sx − Sx = y : yi ≤ xi ∀i, ∃j yj = xj

• If k > 1, FX can have discontinuity points even if PX has no point masses.

• FX can be discontinuous at uncountably many points.

• The continuity points of FX are dense.

• For any j, we have limxj→∞

FX(x) = FXI\j(xI\j)

9.4 (Joint pdf). A function f is a multivariate or joint pdf (commonly called a density)if and only if it satisfies the following two conditions:

(a) f ≥ 0;

(b)∫f(x)d(x) = 1.

• The integrability of the pdf f implies that for all i ∈ I

limxi→±∞

f(xI) = 0.

• P [X ∈ A] =∫AfX(x)dx.

• Remarks: Roughly, we may say the following:

(a) fX(x) = lim∀i, ∆xi→0

P [∀i, xi<Xi≤xi+∆xi]∏i ∆xi

= lim∆x→0

P [x<X≤x+δx]∏i ∆xi

83

(b) For I = [n],

FX(x) =∫ x1

−∞ . . .∫ xn−∞ fX(x)dxn . . . dx1

fX(x) = ∂n

∂x1···∂xnFX (x) .

∂∂uFX (u, . . . , u) =

∑k∈N

∫(−∞,u]n

fX(v(k))dx[n]\k where v

(k)j =

u, j = kxj, j 6= k

.

For example, ∂∂uFX,Y (u, u) =

u∫−∞

fX,Y (x, u)dx+u∫−∞

fX,Y (u, y) dy.

(c) fXn1

(xn1 ) = E[n∏i=1

δ (Xi − xi)].

• The level sets of a density are sets where density is constant.

9.5. Consider two random vectors X : Ω → Rd1 and Y : Ω → Rd2 . Define Z = (X, Y ) :Ω→ Rd1+d2 . Suppose that Z has density fX,Y (x, y).

(a) Marginal Density : fY (y) =∫

Rd1fX,Y (x, y)dx and fX (x) =

∫Rd2

fX,Y (x, y)dy.

• In other words, to obtain the marginal densities, integrate out the unwantedvariables.

• fXI\i(xI\i) =∫fXI (xI)dxi.

(b) fY |X(y|x) =fX,Y (x,y)

fY (y).

(c) FY |X(y|x) =∫ y1−∞ · · ·

∫ yd2−∞ fY |X(t|x)dtd2 · · · dt1.

9.6. P [(X + a1, X + b1) ∩ (Y + a2, Y + b2) 6= ∅] =∫AfX,Y (x, y)dxdy where A is defined

in (1.10).

9.7. Expectation and covariance:

(a) The expectation of a random vector X is defined to be the vector of expectations ofits entries. EX is usually denoted by µX or mX .

(b) For non-random matrix A,B,C and a random vector X, E [AXB + C] = AEXB+C.

(c) The correlation matrix RX of a random vector X is defined by

RX = E[XXT

].

Note that it is symmetric.

(d) The covariance matrix CX of a random vector X is defined as

CX = ΛX = Cov [X] = E[(X − EX)(X − EX)T

]= E

[XXT

]− (EX)(EX)T

= RX − (EX)(EX)T .

(i) The ij-entry of the Cov [X] is simply Cov [Xi, Xj].

84

(ii) ΛX is symmetric.

i. Properties of symmetric matrix

A. All eigenvalues are real.

B. Eigenvectors corresponding to different eigenvalues are not just linearlyindependent, but mutually orthogonal.

C. Diagonalizable.

ii. Spectral theorem : The following equivalent statements hold for symmet-ric matrix.

A. There exists a complete set of eigenvectors; that is there exists an or-thonormal basis u(1), . . . , u(n) of R with CXu

(k) = λku(k).

B. CX is diagonalizable by an orthogonal matrix U (UUT = UTU = I).

C. CX can be represented as CX = UΛUT where U is an orthogonal matrixwhose columns are eigenvectors of CX and λ = diag(λ1, . . . , λn) is adiagonal matrix with the eigenvalues of CX .

(iii) Always nonnegative definite (positive semidefinite). That is ∀a ∈ Rn where n is

the dimension of X, aTCXa = E[(aT (X − µX)

)2]≥ 0.

• det (CX) ≥ 0.

(iv) We can define C12X =√CX to be

√CX = U

√ΛUT where

√Λ = diag(

√λ1, . . . ,

√λn).

i. det√CX =

√detCX .

ii.√CX is nonnegative definite.

iii.√CX

2=√CX√CX = CX .

(v) Suppose, furthermore, that CX is positive definite.

i. C−1X = UΛ−1UT where Λ−1 = diag( 1

λ1, . . . , 1

λn).

ii. C− 1

2X =

√C−1X = (

√CX)−1 = UDUT where D =

(1√λ1, . . . , 1√

λn

)

iii.√CXC

−1X

√CX = I.

iv. C−1X , C

12X , C

− 12

X are all positive definite (and hence are all symmetric).

v.(C− 1

2X

)2

= C−1X .

vi. Let Y = C− 1

2X (X − EX). Then, EY = 0 and CY = I.

(vi) For i.i.d. Xi with each with variance σ2, ΛX = σ2I.

(e) Cov [AX + b] = ACov [X]AT

• Cov[XTh

]= Cov

[hTX

]= hTCov [X]h where h is a vector with the same

dimension as X.

(f) For Y = X + Z, ΛY = ΛX + 2ΛXZ + ΛZ .

• When X and Z are independent, ΛY = ΛX + ΛZ .

85

• For Yi = X + Zi where X and Z are independent, ΛY = σ2X + ΛZ .

(g) ΛX+Y + ΛX−Y = 2ΛX + 2ΛY

(h) det (ΛX+Y ) ≤ 2n det (ΛX + ΛY ) where n is the dimension of X and Y .

(i) Y = (X,X, . . . , X) where X is a random variable with variance σ2X , then ΛY =

σ2X

1 · · · 1...

. . ....

1 · · · 1

. Note that Y = 1X where 1 has the same dimension as Y .

(j) Let X be a zero-mean random vector whose covariance matrix is singular. Then, oneof the Xi is a deterministic linear combination of the remaining components. In otherwords, there is a nonzero vector a such that aTX = 0. In general, if ΛX is singular,then there is a nonzero vector a such that aTX = aTEX.

(k) If X and Y are both random vectors (not necessarily of the same dimension), thentheir cross-covariance matrix is

ΛXY = CXY = Cov [X, Y ] = E[(X − EX)(Y − EY )T

].

Note that the ij-entry of CXY is Cov [Xi, Yj].

• CY X = (CXY )T .

(l) RXY = E[XY T

].

(m) If we stack X and Y in to a composite vector Z =(XY

), then

CZ =

(CX CXYCY X CY

).

(n) X and Y are said to be uncorrelated if CXY = 0, the zero matrix. In which case,

C(XY ) =

(CX 00 CY

),

a block diagonal matrix.

9.8. The joint characteristic function of an n-dimensional random vector X is definedby

ϕX(v) = E[ejv

TX]

= E[ej∑i viXi

].

When X has a joint density fX , ϕX is just the n-dimensional Fourier transform:

ϕX(v) =

∫ejv

T xfX(x)dx,

and the joint density can be recovered using the multivariate inverse Fourier transform:

fX(x) =1

(2π)n

∫e−jv

T xϕX(v)dv.

86

(a) ϕX (u) = EeiuTX .

(b) fX (x) = 1(2π)n

∫e−jv

T xϕX (v)dv.

(c) For Y = AX + b, ϕY (u) = eibTuϕX

(ATu

).

(d) ϕX (−u) = ϕX (u).

(e) ϕX (u) = ϕX,Y (u, 0)

(f) Moment: ∂

n∑i=1

υi

∂vυ11 ∂v

υ22 ···∂v

υnnϕX (0) = j

n∑i=1

υiE(

n∏i=1

Xυii

)

(i) ∂∂viϕX (0) = jEXi.

(ii) ∂2

∂vi∂vjϕX (0) = j2E [XiXj] .

(g) Central Moment:

(i) ∂∂vi

ln (ϕX (v))∣∣∣v=0

= jEXi.

(ii) ∂2

∂vi∂vjln (ϕX (v))

∣∣∣v=0

= −Cov [Xi, Xj] .

(iii) ∂3

∂vi∂vj∂vkln (ϕX (v))

∣∣∣v=0

= j3E [(Xi − EXi) (Xj − EXj) (Xk − EXk)] .

(iv) E [(Xi − EXi) (Xj − EXj) (Xk − EXk) (X` − EX`)] = Ψijk`+ΨijΨk`+ΨikΨj`+

Ψi`Ψjk where Ψijk` = ∂4

∂vi∂vj∂vk∂v`ln (ϕX (v))

∣∣∣v=0

.

Remark : we do not require that any or all of i, j, k, and λ be distinct.

9.9 (Decorrelation and the Karhunen-Loeve expansion). Let X be an n-dimensional ran-dom vector with zero mean and covariance matrix C. X has the representation X = PY ,there the component of Y are uncorrelated and P is an n × n orthonormal matrix. Thisrepresentation is called the Karhunen-Loeve expansion.

• Y = P TX

• P T = P−1 is called a decorrelating transformation.

• Diagonalize C = PDP T where D = diag(λ1). Then, Cov [Y ] = D.

• In MATLAB, use [P,D] = eig(C). To extract the diagonal elements of D as a vector,use the command d = diag(D).

• If C is singular (equivalently, if some of the λi are zero), we only need to keep aroundthe Yi for which λi > 0 and can throw away the other components of Y without anyloss of information. This is because λi = EY 2

i and EY 2i = 0 if and only if Yi ≡ 0 a.s.

[9, p 338–339].

87

9.1 Random Sequence

9.10. [11, p 9–10] Given a countable family X1, X2, . . . of random r.v.’s, their statisticalproperties are regarded as defined by prescribing, for each integer n ≥ 1 and every finiteset I ⊂ N, the joint distribution function FXI of the random vector XI = (Xi : i ∈ I).Of course, some consistency requirements must be imposed upon the infinite family FXI ,namely, that for j ∈ I

(a) FXI\j(xI\j

)= lim

xj→∞FXI (xI) and that

(b) the distribution function obtained from FXI (xI) by interchanging two of the indicesi1, i2 ∈ I and the corresponding variable xi1 and xi2 should be invariant. This simplymeans that the manner of labeling the random variables X1, X2, . . . is not relevant.

The joint distributions FXI are called the finite-dimensional distributions associatedwith XN = (Xn)∞n=1.

10 Transform Methods

10.1 Probability Generating Function

Definition 10.1. [9][11, p. 11] LetX be a discrete random variable taking only nonnegativeinteger values. The probability generating function (pgf) of X is

GX(z) = E[zX]

=∞∑

k=0

zkP [X = k].

• In the summation, the first term (the k = 0 term) is P [X = 0] even when z = 0.

• GX(0) = P [X = 0]

• G(z−1) is the z transform of the pmf.

• GX(1) = 1.

• The names derives from the fact that it can be used to compute the pmf.

• It is finite at least for any complex z with |z| ≤ 1. Hence pgf is well defined for|z| ≤ 1.

Definition 10.2. G(k)X (1) = lim

z1G

(k)X (z).

10.3. Properties

(a) GX is infinitely differentiable at least for |z| < 1.

(b) Probability generating property:

1

k!

d(k)

dz(k)GX(z)

∣∣∣∣z=0

= P [X = k].

88

(c) Moment generating property:

d(k)

dz(k)GX(z)

∣∣∣∣z=1

= E

[k−1∏

i=0

(X − i)].

The RHS is called the kth factorial moment of X.

(d) In particular,EX = G′X (1)

EX2 = G′′X (1) +G′X (1)

VarX = G′′X (1) +G′X (1)− (G′X (1))2

(e) pgf of a sum of independent random variables is the product of the individual pgfs.Let S =

∑ni=1Xi where the Xi’s are independent.

GS(z) =n∏

i=1

GXi(z).

10.2 Moment Generating Function

Definition 10.4. The moment generating function of a random variable X is definedas MX (s) = E

[esX]

=∫esxPX(dx) for all s for which this is finite.

10.5. Properties of moment generating funciton

(a) MX (s) is defined on some interval containing 0. It is possible that the interval consistsof 0 alone.

(i) If X ≥ 0, this interval contains (−∞, 0].

(ii) If X ≤ 0, this interval contains [0,∞).

(b) Suppose that M (s) is defined throughout an interval (−s0, s0) where s0 > 0, i.e. itexists (is finite) in some neighborhood of 0. Then,

(i) X has finite moments of all order: E[|X|k

]<∞∀k ≥ 0

(ii) M (s) =∞∑k=0

sk

k!E[Xk], for complex-valued s with |s| < s0 [1, eqn (21.22) p 278].

Thus M (s) has a Taylor expansion about 0 with positive radius of convergence.

i. If M (s) can somehow be calculated and expanded in a series∞∑k=0

aksk, and if

the coefficients ak can be identified, then ak = 1k!

E[Xk]. That is E

[Xk]

=k!ak

(iii) M (k) (0) = E[Xk]

=∫xkPX (dx)

(c) If M is defined in some neighborhood of s, then M (k) (s) =∫xkesxPX (dx)

(d) See also Chernoff bound.

89

10.3 One-Sided Laplace Transform

10.6. The one-sided Laplace transform of a nonnegative random variable X is definedfor s ≥ 0 by L (s) = M (s) = E

[e−sX

]=∫

[0,∞)

e−sxPX (dx)

• Note that 0 is included in the range of integration.

• Always finite because e−sx ≤ 1. In fact, it is a decreasing function of s

• L (0) = 1

• L (s) ∈ [0, 1]

(a) Derivative : For s > 0, L(k) (s) = (−1)k∫xke−sxPX (dx) = (−1)k E

[Xke−sX

]

(i) lims↓0

dn

dsnL (s) = (−1)n E [Xn]

• Because the value at 0 can be ∞, it does not make sense to talk aboutdn

dsnL (0) for n > 1

(ii) ∀s ≥ 0 L (s) is differentiable and ddsL (s) = −E

[Xe−sX

], where at 0, this is the

right-derivative. ddsL (s) is finite for s > 0

• ddsL (0) = −E [X] ∈ [0,∞]

(b) Inversion formula : If FX is continuous at t > 0, then FX (t) = lims→∞

bstc∑k=0

(−1)k

k!skL(k) (s)

(c) FX and PX are determined by LX (s)

(i) In fact, they are determined by the values of LX (s) for s beyond any arbitrarys0. (That is we don’t need to know LX (s) for small s.) Also, knowing LX (s)on N is also sufficient.

(ii) Let µ and ν be probability measures on [0,∞). If ∃s0 ≥ 0 such that∫e−sxµ (dx) =∫

e−sxν (dx)∀s ≥ s0, then µ = ν

(iii) Let f1, f2 be real functions on [0,∞). If ∃s0 ≥ 0 such that ∀s ≥ s0

∫[0,∞)

e−sxf1 (x)dx =

∫[0,∞)

e−sxf2 (x)dx , then f1 = f2 Lebesgue-a.e.

(d) Let X1, . . . , Xn be independent nonnegative random variables, then L n∑i=1

Xi(s) =

n∏i=1

LXi (s)

(e) Suppose F is a distribution function with corresponding Laplace transform L. Then

(i)∫

[0,∞)

e−sxF (x)dx = 1λL (s)

(ii)∫

[0,∞)

e−sx (1− F (x))dx = 1λ

(1− L (s))

[17, p 183].

90

10.4 Characteristic Function

10.7. The characteristic function (abbreviated c.f. or ch.f.) of a probability measureµ on the line is defined for real t by ϕ (t) =

∫eitxµ (dx)

A random variable X has characteristic function ϕX (t) = E[eitX

]=∫eitxPX (dx)

(a) Always exists because |ϕ(t)| ≤∫|eitx|µ (dx) =

∫1µ (dx) = 1 <∞

(b) If X has a density, then ϕX (t) =∫eitxfX (x) dx

(c) ϕ (0) = 1

(d) ∀t ∈ R |ϕ (t)| ≤ 1

(e) ϕ is uniformly continuous.

(f) Suppose that all moments of X exists and ∀t ∈ R, EetX < ∞, then ϕX (t) =∞∑k=0

(it)k

k!EXk

(g) If E[|X|k

]<∞, then ϕ(k) (t) = ikE

[XkeitX

]and ϕ(k) (0) = ikE

[Xk]

.

(h) Riemann-Lebesgue theorem : If X has a density, then ϕX (t)→ 0 as |t| → ∞

(i) ϕaX+b (t) = eitbϕ (at)

(j) Conjugate Symmetry Property : ϕ−X (t) = ϕX (−t) = ϕX (t)

• X D=−X iff ϕX is real-valued.

• |ϕX | is even.

(k) X is a.s. integer-valued if and only if ϕX (2π) = 1

(l) If X1, X2, . . . , Xn are independent, then ϕ n∑j=1

Xj(t) =

n∏j=1

ϕXj (t)

(m) Inversion

(i) The inversion formula : If the probability measure µ has characteristic func-

tion ϕ and if µ a = µ b = 0, then µ (a, b] = limT→∞

12π

T∫−T

e−ita−e−itbit

ϕ (t)dt

i. In fact, if a < b, then limT→∞

12π

T∫−T

e−ita−e−itbit

ϕ (t)dt = µ (a, b) + 12µ a, b

ii. Equivalently, if F is the distribution function, and a, b are continuity points

of F , then F (b)− F (a) = limT→∞

12π

T∫−T

e−ita−e−itbit

ϕ (t)dt

(ii) Fourier inversion : Suppose that∫|ϕX (t)|dt <∞, then X is absolutely con-

tinuous with

91

i. bounded continuous density f (x) = 12π

∫e−itxϕ (t)dt

ii. µ (a, b] = 12π

∫e−ita−e−itb

itϕ (t)dt

(n) Continuity Theorem : XnD−→ X if and only if ∀t ϕXn (t)→ ϕX (t) (pointwise).

(o) ϕX on complex plane for X ≥ 0

(i) ϕX (z) is defined in the complex plane for Imz ≥ 0

i. |ϕX (z)| ≤ 1 for such z

(ii) In the domain Im z > 0, ϕX is analytic and continuous including the boundaryIm z = 0

(iii) ϕX determines uniquely a function LX (s) of real argument s ≥ 0 which isequal to LX (s) = ϕX (is) = Ee−sX . Conversely, LX (s) on the half-line s ≥ 0determines uniquely ϕX

(p) If E |X|n <∞, then

∣∣∣∣∣ϕX (t)−n∑

k=0

(it)k

k!EXk

∣∣∣∣∣ ≤ E

[min

|t|n+1

(n+ 1)!|X|n+1 ,

2 |t|nn!|X|n

](24)

and ϕX (t) =n∑k=0

(it)k

k!EXk + tnβ (t) where lim

|t|→0β (t) = 0 or equivalently,

ϕX (t) =n∑

k=0

(it)k

k!EXk + o (tn) (25)

(i) |ϕX (t)− 1| ≤ E [min |tX| , 2]• If EX = 0, then |ϕX (t)− 1| ≤ E

[min

t2

2X2, 2 |t| |X|

]≤ t2

2EX2

(ii) For integrable X, |ϕX (t)− 1− itEX| ≤ E[min

t2

2X2, 2 |t| |X|

]≤ t2

2EX2

(iii) For X with finite EX2,

i.∣∣ϕX (t)− 1− itEX + 1

2t2EX2

∣∣ ≤ E min|t|36|X|3 , t2 |X|2

ii. ϕX (t) = 1 + itEX − 12t2EX2 + t2β (t) where lim

|t|→0β (t) = 0

10.8. φX(u) = MX(ju).

10.9. If X is a continuous r.v. with density fX , then ϕX(t) =∫ejtxfX(x)dx (Fourier

transform) and fX(x) = 12π

∫e−jtxφX(t)dt (Fourier inversion formula).

• ϕX(t) is the Fourier transform of fX evaluated at −t.• ϕX inherit the properties of a Fourier transform.

(a) For nonnegative ai such that∑

i ai = 1, if fY =∑

i aifXi , then ϕX =∑

i aiϕXi .

92

(b) If fX is even, then ϕX is also even.

• If fX is even, ϕX = ϕ−X .

10.10. Linear combination of independent random variables: Suppose X1, . . . , Xn areindependent. Let Y =

∑ni=1 aiXi. Then, ϕY (t) =

∏ni=1 ϕX(ait). Furthermore, if |ai| = 1

and all fXi are even, then ϕY (t) =∏n

i=1 ϕX(t).

(a) ϕX+Y (t) = ϕX(t)ϕY (t).

(b) ϕX−Y (t) = ϕX(t)ϕY (−t).

(c) If fY is even, ϕX−Y = ϕX+Y = ϕXϕY .

10.11. Characteristic function for sum of distribution: Consider nonnegative ai such that∑i ai = 1. Let Pi be probability measure with corresponding ch.f. ϕi. Then, the ch.f. of∑i aiPi is

∑i aiϕi.

(a) Discrete r.v.: Suppose pi is pmf with corresponding ch.f. ϕi. Then, the ch.f. of∑i aipi is

∑i aiϕi.

(b) Absolutely continuous r.v.: Suppose fi is pdf with corresponding ch.f. ϕi. Then, thech.f. of

∑i aifi is

∑i aiϕi.

11 Functions of random variables

Definition 11.1. The preimage or inverse image of a set B is defined by g−1(B) =x : g(x) = B.11.2. For discrete X, suppose Y = g(X). Then, pY (y) =

∑x∈g−1(y) pX(x). The joint

pmf of Y and X is given by pX,Y (x, y) = pX(x)1[y = g(x)].

• In most cases, we can show that X and Y are not independent, pick a point x(0)

such that pX(x(0))> 0. Pick a point y(1) such that y(1) 6= g

(x(0))

and pY(y(1))> 0.

Then, pX,Y(x(0), y(1)

)= 0 but pX

(x(0))pY(y(1))> 0. Note that this technique does

not always work. For example, if g is a constant function which maps all values ofx to a constant c. Then, we will not be able to find y(1). Of course, this is to beexpected because we know that a constant is always independent of other randomvariables.

11.1 SISO case

There are many techniques for finding the cdf and pdf of Y = g(X).

(a) One may first find FY (y) = P [g(X) ≤ y] first and then find fY from F ′Y (y). In whichcase, the Leibniz’ rule in (52) will be useful.

(b) Formula (27) below provides a convenient way of arriving at fY from fX withoutgoing through FY .

93

11.3 (Linear transformation). Y = aX + b where a 6= 0.

FY (y) =

FX(y−ba

), a > 0

1− FX((

y−ba

)−), a < 0

(a) Suppose X is absolutely continuous,

fY (y) =1

|a|fX(y − ba

). (26)

In fact (26) holds even for mixed r.v. if we allow delta function because 1|a|δ(y−ba− xk

)=

δ (y − (axk + b)).

(b) Suppose X is discrete,

pY (y) = pX

(y − ba

).

If we write fX(x) =∑

k pX(xk)δ(x− xk), we have

fY (y) =∑

k

p (xk) δ (y − (axk + b)).

11.4 (Power Law Function). Y = Xn, n ∈ N or n ∈ (0,∞).

(a) n odd: FY (y) = FX

(y

1n

)and fY (y) = 1

ny

1n−1fX

(y

1n

).

(b) n even:

FY (y) =

FX

(y

1n

)− FX

((−y 1

n

)−), y ≥ 0

0, y < 0.

and

fY (y) =

1ny

1n−1(fX

(y

1n

)+ fX

(−y 1

n

)), y ≥ 0

0, y < 0.

Again, the density fY in the above formula holds when X is absolutely continuous. Notethat when n < 1, fY is not defined at 0. If we allow delta functions, then the density

formula above are also valid for mixed r.v. because 1ny

1n−1δ

(±y 1

n − xk)

= δ (y − (±xk)n).

• Let X be an absolutely continuous random variable. The density of Y = X2 is

fY (y) =

0, y < 0

12√yfX(√

y)

+ 12√yfX(−√y

), y ≥ 0.

11.5. In general, for Y = g(X), we solve the equation y = g(x). Denoting its real rootsby xk. Then,

fY (y) =∑

k

fX(xk)

|g′(xk)|. (27)

If g(x) = c = constant for every x in the interval (a, b), then FY (y) is discontinuous fory = c. Hence, fY (y) contains an impulse (FX(b)− FX(a)) δ(y − c) [15, p. 93–94].

94

• To see this, consider when there is unique x such that g(x) = y. Then, For small ∆xand ∆y, P [y, y < Y ≤ y+∆y] = P [x < X ≤ x+∆x] where (y+∆y] = g ((x, x+ ∆x])is the image of the interval (x, x+∆x]. (Equivalently, (x, x+∆x] is the inverse imageof y + ∆y].) This gives fY (y)∆y = fX(x)∆x.

• The joint density fX,Y is

fX,Y (x, y) = fX (x) δ (y − g (x)) . (28)

Let the xk be the solutions for x of g(x) = y. Then, by integrating (28) w.r.t. x, wehave (27) via the use of (4).

• When g bijective,

fY (y) =

∣∣∣∣d

dyg−1 (y)

∣∣∣∣ fX(g−1 (y)

).

• For Y = aX

, fY (y) =∣∣∣ay∣∣∣ fX

(ay

).

• Suppose X is nonnegative. For Y =√X,

fY (y) = 2yfX(y2).

11.6. Given Y = g(X) where X ∼ U(a, b). Then, to get fY (y0), plot g on (a, b). LetA = g−1 (y0) be the set of all points x such that g(x) = y0. Suppose A can be writtenas a countable disjoint union A = B ∪∪iIi where B is countable and the Ii’s are intervals.We have

fY (y) =1

b− a1

|g′(x)| +

(∑

i

`(Ii)

)δ(y − y0)

at y = y0 where `(I) is the length of the interval I.

• Suppose Θ is uniform on an interval of length 2π. Y1 = cos Θ and Y2 = sin Θ areboth arcsine random variables with FYi(y) = 1 − 1

πcos−1 y = 1

2+ 1

πsin−1(y) and

fYi(y) = 1π

1√1−y2

for y ∈ [−1, 1]. Note also that E [Y1Y2] = EYi = Cov [Y1, Y2] = 0.

Hence, Y1 and Y2 are uncorrelated. However, it is easy to see that Y1 and Y2 are notindependent by considering the joint and marginal densities at y1 = y2 = 0.

Example 11.7. Y = X21[X≥0]

(a) FY (y) =

0, y < 0FX (0), y = 0FX(√

y), y > 0

(b) fY (y) =

0, y < 0

12√yfX(√

y), y > 0

+ FX (0) δ (y)

95

P1: RakeshTung.cls Tung˙C02 July 28, 2005 14:27

50 Chapter Two

t- ∞ < x < ∞

nRayleigh

0xb>

min(X1,◊◊◊,XK)

X1

1/X

10

==

aa

a = b = 1

a = 0 b = 1

a + (b − a)X21 XX -

-blogX

n = 1

a = 1a1

X2X

X

2

1

nn

→ ∞2=b 2=n

21/2

n aa

==2 2 21/XX

11/nX

22/nX

2X

1n =

a + a X

1=a

n → ∞

a = n

21

1

XXX+

m = abs 2 ab 2

m + sX

a → ∞=X m

s-

Standardnormal

∞<<- ∞ x

Gamma

ba,0>x

Erlang

nx

,0

b>

Exponential0x

b>

Standarduniform

10 << x

F0x >

Weibull

ba,

0>xTriangular

1–1 << x

Uniform a < x < b

a, b

log YLognormal

y > 0

Beta0 < x < 1

a, b

a = b → ∞ Normal- ∞ < x < ∞

m, s

Cauchy

a,ax ∞<<- ∞

Chi-square

n

Standardcauchy

∞<<- ∞ x

Y = eX

x 0>

Poisson x = 0,1⋅⋅⋅

v

Continuous

distributions

Discrete

distributions

3

1n

np = Hypergeometric x = 0,1⋅⋅⋅ , min(n1, n2)

n1, n2, n3

n = 1

Bernoulli x = 0,1

p

n ∞v = np

X1 + ⋅⋅⋅ + XK

X1 + ⋅⋅⋅ + XKX1 + ⋅⋅⋅ + XK

X1 + ⋅⋅⋅ + XK

X1 + ⋅⋅⋅ + XK

X1 + ⋅⋅⋅ + XK

X1 + ⋅⋅⋅ + XK

←

n3 ∞←

Binomial x = 0,1 ⋅⋅⋅ n

n, p

m = nps 2 = np(1 – p)n ∞

s 2 = v m = v ←

X

n1, n2

Figure 2.15 Relationships among univariate distributions. (After Leemis, 1986.)Figure 20: Relationships among univariate distributions [22]

96

Exponential: E(λ)

( ) [ ) ( )0,1xXf x e xλλ −

∞=

( )~ ~X X λλ αα⎛ ⎞⇒ ⎜ ⎟⎝ ⎠

E E .

Xi ~ E(λi) ⇒ ( )min ~i ii

X λ⎛ ⎞⎜ ⎟⎝ ⎠∑E .

Normal/Gaussian: ( )2,m σN

( )21

212

x

Xf x eμ

σ

σ π

−⎛ ⎞− ⎜ ⎟⎝ ⎠=

Rayleigh: ( ) [ ) ( )2

0,2 1xf x xe xαα −∞=

2 2X Y+ with . . . 1, ~ 0,

2

i i dX Y

α⎛ ⎞⎜ ⎟⎝ ⎠

N

2X λ α=

2 2X Y+ with . . . 1, ~ 0,

2

i i dX Y

λ⎛ ⎞⎜ ⎟⎝ ⎠

N

Gamma : ( ),q λΓ

( ) ( ) ( ) ( )1

0,1q q x

Xx ef x x

q

λλ − −

∞=Γ

1q =

( )~ ,i iX q λΓ ⇒ ~ ,i ii i

X q λ⎛ ⎞Γ⎜ ⎟⎝ ⎠

∑ ∑ .

( )~ ,X q λΓ ⇒ Y Xα= ~ ,q λα

⎛ ⎞Γ⎜ ⎟⎝ ⎠

.

1

1 2

XX X+

with ( )~ ,i iX q λΓ

( )2 22

1 1~ 0, ~ ,2 2

X Xσσ

⎛ ⎞⇒ Γ⎜ ⎟⎝ ⎠

N

( ). . .

2 2

1 with ~ 0,

n i i d

i ii

X X σ=∑ N

Beta:

( ) ( )( ) ( ) ( ) ( ) ( )21

,1 2

11 2 10,1

1 2

1 1q q

qqq qf x x x x

q qβ−−Γ +

= −Γ Γ

Chi-squared 2χ : 2

1,2 2n

σ⎛ ⎞Γ⎜ ⎟⎝ ⎠

Usually, 1σ = .

2

1,2 2nq λ

σ= =

Din November 8, 2004

Uniform: ( ),a bU

( ) [ ] ( ),1 1X a bf x x

b a=

−

( ) ( ) ( )1~ 0,1 ln ~X X λλ

⇒ −U E

Note: ( )Y g X= ; monotone g ⇒ ( ) ( ) ( )( )1 1Y X

df y g y f g ydy

− −= Figure 21: Another diagram demonstrating relationship among univariate distributions

97

11.2 MISO case

11.8. If X and Y are jointly continuous random variables with joint density fX,Y . Thefollowing two methods give the density of Z = g(X, Y ).

• Condition on one of the variable, say Y = y. Then, begin conditioned, Z is simply afunction of one variable g(X, y); hence, we can use the one-variable technique to findfZ|Y (z|y). Finally, fZ(z) =

∫fZ|Y (z|y)fY (y)dy.

• Directly find the joint density of the random vector(ZY

)=(g(X,Y )Y

). Observe that the

Jacobian is

( ∂g∂x

∂g∂y

0 1.

). Hence, the magnitude of the determinant is

∣∣ ∂g∂x

∣∣.

Of course, the standard way of finding the pdf of Z is by finding the derivative of thecdf FZ(z) =

∫(x,y):x2+y2≤z fX,Y (x, y)d(x, y). This is still good for solving specific examples.

It is also a good starting point for those who haven’t learned conditional probability norJacobian.

Let the x(k) be the solutions of g(x, y) = z for fixed z and y. The first method gives

fZ|Y (z|y) =∑

k

fX|Y (x|y)∣∣ ∂∂xg(x, y)

∣∣

∣∣∣∣∣x=x(k)

.

Hence,

fZ,Y (z, y) =∑

k

fX,Y (x, y)∣∣ ∂∂xg(x, y)

∣∣

∣∣∣∣∣x=x(k)

,

which comes out of the second method directly. Both methods then gives

fZ(z) =

∫ ∑

k

fX,Y (x, y)∣∣ ∂∂xg(x, y)

∣∣

∣∣∣∣∣x=x(k)

dy.

The integration for a given z is only on the value of y such that there is at least a solutionfor x in z = g(x, y). If there is no such solution, fZ(z) = 0. The same technique works fora function of more than one random variables Z = g(X1, . . . , Xn). For any j ∈ [n], let the

x(k)j be the solutions for xj in z = g(x1, . . . , xn). Then,

fZ(z) =

∫ ∑

k

fX1,...,Xn(x1, . . . , xn)∣∣∣ ∂∂xjg(x1, . . . , xn)

∣∣∣

∣∣∣∣∣∣xj=x

(k)j

dx[n]\j.

For the second method, we consider the random vector (hr(X1, . . . , Xn), r ∈ [n]) wherehr(X1, . . . , Xn) = Xr for r 6= j and hj = g. The Jacobian is of the form

1 0 0 00 1 0 0∂g∂x1

∂g∂x2

∂g∂xj

∂g∂xn

0 0 0 1

.

98

By swapping the row with all the partial derivatives to the first row, the magnitude ofthe determinant is unchanged and we also end up with upper triangular matrix whosedeterminant is simply the product of the diagonal elements.

(a) For Z = aX + bY ,

fZ(z) =

∫1

|a|fX,Y(z − bya

, y

)dy =

∫1

|b|fX,Y(x,z − axb

)dx.

• Note that Jacobian((ax+by

yy

),(xy

))=

(a b0 1

).

(i) When a = 1, b = −1,

fX−Y (z) =

∫fX,Y (z + y, y) dy =

∫fX,Y (x, x− z) dx.

(ii) Note that when X and Y are independent and a = b = 1, we have the convolu-tion formula

fZ(z) =

∫fX (z − y) fY (y)dy =

∫fX(x)fY (z − x) dx.

(b) For Z = XY ,

fZ(z) =

∫fX,Y

(x,z

x

) 1

|x|dx =

∫fX,Y

(z

y, y

)1

|y|dy

[9, Ex 7.2, 7.11, 7.15]. Note that Jacobian((

xyy

),(xy

))=

(y x0 1

).

(c) For Z = X2 + Y 2,

fZ (z) =

√z∫

−√z

fX|Y

(√z − y2

∣∣∣ y)

+ fX|Y

(−√z − y2

∣∣∣ y)

2√z − y2

fY (y)dy

[9, Ex 7.16]. Alternatively, applying (33), we have

fZ(z) =1

2

∫ 2π

0

fX,Y (√z cos θ,

√z sin θ)dθ, z > 0 (29)

[16, eq (9.14), p. 318)].

• This can be used to show that when X, Yi.i.d.∼ N (0, 1), Z = X2 + Y 2 ∼ E

(12

).

(d) For R =√X2 + Y 2, applying (32), we have

fR(r) = r

∫ 2π

0

fX,Y (r cos θ, r sin θ)dθ, r > 0 (30)

[16, eq (9.13), p. 318)].

99

(e) For Z = YX

,

fZ(z) =

∫ ∣∣∣yz

∣∣∣ fX,Y(yz, y)dy =

∫|x| fX,Y (x, xz)dx.

Similarly, when Z = XY

,

fZ(z) =

∫|y| fX,Y (yz, y)dy.

(f) For Z = min(X,Y )max(X,Y )

where X and Y are strictly positive,

FZ(z) =

∫ ∞

0

FY |X(zx|x)fX(x)dx+

∫ ∞

0

FX|Y (zy|y)fY (y)dy,

fZ(z) =

∫ ∞

0

xfY |X(zx|x)fX(x)dx+

∫ ∞

0

yfX|Y (zy|y)fY (y)dy, 0 < z < 1.

[9, Ex 7.17].

11.9 (Random sum). Let S =∑N

i=1 Vi where Vi’s are i.i.d. ∼V independent of N .

(a) ϕS (u) = ϕN (−i ln (ϕV (u))).

• ϕS(u) = GN(ϕV (u))

• For non-negative integer-valued summands, we have GS(z) = GN (GV (z))

(b) ES = ENEV .

(c) Var [S] = EN (VarV ) + (EV )2 (VarN).

Remark : If N ∼ P (λ), then ϕS (u) = exp (λ (ϕV (u)− 1)), the compound Poissondistribution CP (λ,L (V )). Hence, the mean and variance of CP (λ,L (V )) are λEV andλEV 2 respectively.

11.3 MIMO case

Definition 11.10 (Jacobian). In vector calculus, the Jacobian is shorthand for either theJacobian matrix or its determinant, the Jacobian determinant. Let g be a functionfrom a subset D of Rn to Rm. If g is differentiable at z ∈ D, then all partial derivativesexists at z and the Jacobian matrix of g at a point z ∈ D is

dg (z) =

∂g1∂x1

(z) · · · ∂g1∂xn

(z)...

. . ....

∂gm∂x1

(z) · · · ∂gm∂xn

(z)

=

(∂g

∂x1

(z) , . . . ,∂g

∂xn(z)

).

Alternative notations for the Jacobian matrix are J, ∂(g1,...,gn)∂(x1,...,xn)

[7, p 242], Jg(x) where the it

is assumed that the Jacobian matrix is evaluated at z = x = (x1, . . . , xn).

100

• Let A be an n-dimensional “box” defined by the corners x and x+∆x. The “volume”of the image g(A) is (

∏i ∆xi) |det dg(x)|. Hence, the magnitude of the Jacobian

determinant gives the ratios (scaling factor) of n-dimensional volumes (contents). Inother words,

dy1 · · · dyn =

∣∣∣∣∂(y1, . . . , yn)

∂(x1, . . . , xn)

∣∣∣∣ dx1 · · · dxn.

• d(g−1(y)) is the Jacobian of the inverse transformation.

• In MATLAB, use jacobian.

See also (A.16).

11.11 (Jacobian formulas). Suppose g is a vector-valued function of x ∈ Rn, and X isan Rn-valued random vector. Define Y = g(X). (Then, Y is also an Rn-valued randomvector.) If X has joint density fX , and g is a suitable invertible mapping (such that theinversion mapping theorem is applicable), then

fY (y) =1

|det (dg (g−1 (y)))|fX(g−1 (y)

)=∣∣det

(d(g−1)

(y))∣∣ fX

(g−1 (y)

).

• Note that for any matrix A, det(A) = det(AT ). Hence, the formula above couldtolerate the incorrect “Jacobian”.

In general, let X = (X1, X2, . . . , Xn) be a random vector with pdf fX(x). Let S =x : fX(x) > 0. Consider a new random vector Y = (Y1, Y2, . . . , Yn), defined by Yi =gi(X). Suppose that A0, A1, . . . , Ar form a partition of S with these properties. The setA0, which may be empty, satisfies P [X ∈ A0] = 0. The transformation Y = g(X) =(g1(X), . . . , gn(X)) is a one-to-ont transformation from Ai onto some common set B foreach i ∈ [k]. Then, for each i, the inverse functions from B to Ai can be found. Denote

the kth inverse x = h(k)(u) by xj = h(k)j (y). This kth inverse gives, for y ∈ B, the unique

x ∈ Ak such that y = g(x). Assuming that the Jacobians det(dh(k)(y)) do not vanishidentically on B, we have

fY (y) =r∑

k=1

fX(h(k)(y))∣∣det(dh(k)(y))

∣∣ , y ∈ B

[2, p. 185].

• Suppose for some k, Yk is some functions of other Yi. In particular, suppose Yk = h(yI)for some index set I and some deterministic function h. Then, the kth row of theJacobian matrix is a linear combination of other rows. In particular,

∂yk∂xj

=∑

i∈I

(∂

∂yih (yI)

)∂yi∂xj

.

Hence, the Jacobian determinant is 0.

101

11.12. Suppose Y = g(X) where both X and Y have the same dimension, then the jointdensity of X and Y is

fX,Y (x, y) = fX(x)δ(y − g(x)).

• In most cases, we can show that X and Y are not independent, pick a point x(0)

such that fX(x(0))> 0. Pick a point y(1) such that y(1) 6= g

(x(0))

and fY(y(1))> 0.

Then, fX,Y(x(0), y(1)

)= 0 but fX

(x(0))fY(y(1))> 0. Note that this technique does

not always work. For example, if g is a constant function which maps all values ofx to a constant c. Then, we will not be able to find y(1). Of course, this is to beexpected because we know that a constant is always independent of other randomvariables.

Example 11.13.

(a) For Y = AX + b, where A is a square, invertible matrix,

fY (y) =1

|detA|fX(A−1(y − b)

). (31)

(b) Transformation between Cartesian coordinates (x, y) and polar coordinates (r, θ)

• x = r cos θ, y = r sin θ, r =√x2 + y2, θ = tan−1

(yx

).

•∣∣∣∣∂x∂r

∂x∂θ

∂y∂r

∂y∂θ

∣∣∣∣ =

∣∣∣∣cos θ −r sin θsin θ r cos θ

∣∣∣∣ = r. (Recall that dxdy = rdrdθ).

We have

fR,Θ (r, θ) = rfX,Y (r cos θ, r sin θ) , r ≥ 0 and θ ∈ (−π, π), (32)

and

fX,Y (x, y) =1√

x2 + y2fR,Θ

(√x2 + y2, tan−1

(yx

)).

If, furthermore, Θ is uniform on (0, 2π) and independent of R. Then,

fX,Y (x, y) =1

2π

1√x2 + y2

fR

(√x2 + y2

).

(c) A related transformation is given by Z = X2 + Y 2 and Θ = tan−1(YX

). In this case,

X =√Z cos Θ, Y =

√Z sin Θ, and

fZ,Θ (z, θ) =1

2fX,Y

(√z cos θ,

√z sin θ

)(33)

which gives (29).

11.14. Suppose X, Y are i.i.d. N (0, σ2). Then, R =√X2 + Y 2 and Θ = arctan Y

Xare

independent with R being Rayleigh(fR(r) = r

σ2 e− 1

2( rσ )2

U(r))

and Θ being uniform on

[0, 2π].

102

11.15 (Generation of a random sample of a normally distributed random variable). LetU1, U2 be i.i.d. U(0, 1). Then, the random variables

X1 =√−2 lnU1 cos(2πU2)

X2 =√−2 lnU1 sin(2πU2)

are i.i.d. N (0, 1). Moreover,

Z1 =√−2σ2 lnU1 cos(2πU2)

Z2 =√−2σ2 lnU1 sin(2πU2)

are i.i.d. N (0, σ2).

• The idea is to generate R and Θ according to (11.14) first.

• det(dx(u)) = −2πu1

, u1 = e−x21+x22

2 .

11.16. In (11.11), suppose dim(Y ) = dim(g) ≤ dim(X). To find the joint pdf of Y , weintroduce “arbitrary” Z = h(X) so that dim

(YZ

)= dim(X).

11.4 Order Statistics

Given a sample of n random variables X1, . . . , Xn, reorder them so that Y1 ≤ Y2 ≤ · · · ≤ Yn.Then, Yi is called the ith order statistic, sometimes also denoted Xi:n, X〈i〉, X(i), Xn:i, Xi,n,or X(i)n.

In particular

• Y1 = X1:n = Xmin is the first order statistic denoting the smallest of the Xi’s,

• Y2 = X2:n is the second order statistic denoting the second smallest of the Xi’s . . .,and

• Yn = Xn:n = Xmax is the nth order statistic denoting the largest of the Xi’s.

In words, the order statistics of a random sample are the sample values placed in ascendingorder [2, p 226]. Many results in this section can be found in [4].

11.17. Events properties:

[Xmin ≥ y] =⋂i [Xi ≥ y] [Xmax ≥ y] =

⋃i [Xi ≥ y]

[Xmin > y] =⋂i [Xi > y] [Xmax > y] =

⋃i [Xi > y]

[Xmin ≤ y] =⋃i [Xi ≤ y] [Xmax ≤ y] =

⋂i [Xi ≤ y]

[Xmin < y] =⋃i [Xi < y] [Xmax < y] =

⋂i [Xi < y]

Let Ay = [Xmax ≤ y], By = [Xmin > y]. Then, Ay = [∀i Xi ≤ y] and By = [∀i Xi > y].

103

11.18 (Densities). Suppose the Xi are absolutely continuous with joint density fX . LetSy be the set of all n! vector which comes from permuting the coordinates of y.

fY (y) =∑

x∈Sy

fX(x), y1 ≤ y2 ≤ · · · ≤ yn. (34)

To see this, note that fY (y)(∏

j ∆yj

)is the probability that Yj is in the small interval of

length ∆yj around yj. This probability can be calculated from finding the probability thatall Xk fall into the above small regions.

From the joint density, we can find the joint pdf/cdf of YI for any I ⊂ [n]. However, inmany cases, we can directly reapply the above technique to find the joint pdf of YI . Thisis especially useful when the Xi are independent or i.i.d.

(a) The marginal density fYk can be found by approximating fYk (y) ∆y with

n∑

j=1

∑

I∈([n]\jk−1 )

P [Xj ∈ [y, y + ∆y) and ∀i ∈ I,Xi ≤ y and ∀r ∈ (I ∪ k)c , Xr > y],

where for any set A and integer ` ∈ |A|, we define(A`

)to be the set of all k-element

subsets of A. Note also that we assume (I ∪ k)c = [n] \ (I ∪ k).To see this, we first choose the Xj that will be Yk with value around y. Then, wemust have k − 1 of the Xi below y and have the rest of the Xi > y.

(b) For integers r < s, the joint density fYr,Ys(yr, ys)∆yr∆ys can be approximated by theprobability that two of the Xi are inside small regions around yr and ys. To makethem Yr and Ys, for the other Xi, r− 1 of them before yr, s− r− 1 of them betweenyr and ys, and n− s of them beyond ys.

• fXmax,Xmin(u, v)∆u∆v can be approximated by by

∑

(j,k)∈S

P [Xj ∈ [u, u+∆u), Xj ∈ [v, v+∆v), and ∀i ∈ [n]\j, k , v < Xi ≤ u], v ≤ u,

where S is the set of all n(n− 1) pairs (j, k) from [n]× [n] with j 6= k. This issimply choosing the j, k so that Xj will be the maximum with value around u,and Xk will be the minimum with value around v. Of course, the rest of the Xi

have to be between the min and max.

When n = 2, we can use (34) to get

fXmax,Xmin(u, v) = fX1,X2(u, v) + fX1,X2(v, u), v ≤ u.

Note that the joint density at point yI is0 if the the elements in yI are not arranged in the“right” order.

11.19 (Distribution functions). We note again the the cdf may be obtained by integrationof the densities in (11.18) as well as by direct arguments valid also in the discrete case.

104

(a) The marginal cdf is

FYk (y) =n∑

j=k

∑

I∈([n]j )

P [∀i ∈ I, Xi ≤ y and ∀r ∈ [n] \I, Xr > y].

This is because the event [Yk ≤ y] is the same as event that at least k of the Xi are≤ y. In other words, let N(a) =

∑ni=1 1[Xi ≤ a] be the number of Xi which are ≤ a.

[Yk ≤ y] =

[∑

i

N(y) ≥ k

]=⋃

j≥k

[∑

i

N(y) = j

], (35)

where the union is a disjoint union. Hence, we sum the probability that exactly jof the Xi are ≤ y for j ≥ k. Alternatively, note that the event [Yk ≤ y] can also beexpressed as a disjoint union

⋃

j≥k

[Xi ≤ k and exactly k − 1 of the X1, . . . , Xj−1 are ≤ y] .

This gives

FYk(y) =n∑

j=k

∑

I∈([j−1]k−1 )

P [Xj ≤ y,∀i ∈ I, Xi ≤ y, and ∀r ∈ [j − 1] \ I, Xr > y] .

(b) For r < s, Because Yr ≤ Ys, we have

[Yr ≤ yr] ∩ [Ys ≤ ys] = [Ys ≤ ys] , ys ≤ yr.

By (35), for yr < ys,

[Yr ≤ yr] ∩ [Ys ≤ ys] =

(n⋃

j=r

[N (yr) = j]

)∩(

n⋃

m=s

[N (ys) = m]

)

=n⋃

m=s

m⋃

j=r

[N (yr) = j and N (ys) = m],

where the upper limit of the second union is changed from n to m because we musthave N(yr) ≤ N(ys). Now, to have N(yr) = j and N(ys) = m for m > j is to put jof the Xi in (−∞, yr], m− j of the Xi in (yr, ys], and n−m of the Xi in (ys,∞).

(c) Alternatively, for Xmax, Xmin, we have

FXmax,Xmin(u, v) = P (Au ∩Bc

v) = P (Au)− P (Au ∩Bv)

= P [∀i Xi ≤ u]− P [∀i v < Xi ≤ u]

where the second term is 0 when u < v. So,

FXmax,Xmin(u, v) = FX1,...,Xn(u, . . . , u)

105

when u < v. When v ≥ u, the second term can be found by (23) which gives

FXmax,Xmin(u, v) = FX1,...,Xn(u, . . . , u)−

∑

w∈S

(−1)|i:wi=v|FX1,...,Xn(w)

=∑

w∈S\(u,...,u)

(−1)|i:wi=v|+1FX1,...,Xn(w).

where S = u, vn is the set of all 2n vertices w of the “box” ×i∈[n]

(ai, bi]. The joint

density is 0 for u < v.

11.20. For independent Xi’s,

(a) fY (y) =∑x∈Sy

n∏i=1

fXi (xi)

(b) Two forms of marginal cdf:

FYk (y) =n∑

j=k

∑

I∈([n]j )

(∏

i∈I

FXi (y)

) ∏

r∈[n]\I

(1− FXi (y))

=n∑

j=k

∑

I∈([j−1]k−1 )

FXj(y)

(∏

i∈I

FXi(y)

) ∏

r∈[j−1]\I

(1− FXr(y))

(c) FXmax,Xmin(u, v) =

∏k∈[n]

FXk (u)−

0, u ≤ v∏k∈[n]

(FXk (u)− FXk (v)), v < u

(d) The marginal cdf is

FYk (y) =n∑

j=k

∑

I∈([n]j )

(∏

i∈I

FXi(y)

) ∏

r∈[n]\I

(1− FXr(y))

.

(e) FXmin(v) = 1−∏

i

(1− FXi (v)).

(f) FXmax (u) =∏i

FXi (u).

11.21. Suppose Xii.i.d.∼ X with common density f and distribution function F .

(a) The joint density is given by

fY (y) = n!f(y1)f(y2) . . . f(yn), y1 ≤ y2 ≤ · · · ≤ yn.

106

If we define y0 = −∞, yk+1=∞, n0 = 0, nk+1 = n + 1, then for k ∈ [n] and 1 ≤ n1 <· · · < nk ≤ n, the joint density fYn1 ,Yn2 ,...,Ynk

(yn1 , yn2 , . . . , ynk) is given by

n!

(k∏

j=1

f(ynj)

)k∏

j=1

(F (ynj+1)− F (ynj))

nj+1−nj−1

(nj+1 − nj − 1)!.

In particular, for r < s, the joint density fYr,Ys(yr, ys) is given by

n!

(r − 1)!(s− r − 1)!(n− s)!f(yr)f(ys)Fr−1(yr)(F (ys)− F (yr))

s−r−1(1− F (ys))n−s

[2, Theorem 5.4.6 p 230].

(b) The joint cdf FYr,Ys (yr, ys) is given by

FYs (ys), ys ≤ yr,n∑

m=s

m∑j=r

n!j!(m−j)!(n−m)!

(F (yr))j (F (ys)− F (yr))

m−j (1− F (ys))n−m, yr < ys.

(c) FXmax,Xmin(u, v) = (F (u))n −

0, u ≤ v(F (u)− F (v))n , v < u

.

(d) fXmax,Xmin(u, v) =

0, u ≤ v

n (n− 1) fX (u) fX (v) (F (u)− F (v))n−2 , v < u

(e) Marginal cdf:

FYk (y) =n∑

j=k

(n

j

)(F (y))j (1− F (y))n−j

=n∑

j=k

(j − 1k − 1

)(F (y))k(1− F (y))j−k = (F (y))k

n−k∑

m=0

(k +m− 1k − 1

)(1− F (y))m

=n!

(k − 1)! (n− k)!

∫ F (y)

0tk−1 (1− t)n−kdt.

Note that N(y) ∼ B (n, F (y)). The last equality comes from integrating the marginaldensity fYk in (36) with change of variable t = F (y).

(i) FXmax (y) = (F (y))n and fXmax (y) = n (F (y))n−1 fX (y).

(ii) FXmin(y) = 1− (1− F (y))n and fXmin

(y) = n (1− F (y))n−1 fX (y).

(f) Marginal density:

fYk (y) =n!

(k − 1)! (n− k)!(F (y))k−1 (1− F (y))n−k fX (y) (36)

[2, Theorem 5.4.4 p 229]

Consider small neighborhood ∆y around y. To have Yk ∈ ∆y, we must have exactlyone of the Xi’s in ∆y, exactly k − 1 of them less than y, and exactly n − k of themgreater than y. There are n

(n−1k−1

)= n!

(k−1)!(n−k)!= 1

B(k,n−k+1)possible setups.

107

(g) The range R is defined as R = Xmax −Xmin.

(i) For x > 0, fR(x) = n(n− 1)∫

(F (u)− F (u− x))n−2f(u− x)f(u)du.

(ii) For x ≥ 0, FR(x) = n∫

(F (u)− F (u− x))n−1f(u)du.

Both pdf and cdf above are derived by first finding the distribution of the rangeconditioned on the value of the Xmax = u.

See also [4, Sec. 2.2] and [2, Sec. 5.4].

11.22. Let X1, X2, . . . , Xn be a random sample from a discrete distribution with pmfpX(xi) = pi, where x1 < x2 < . . . are the possible values of X in ascending order. DefineP0 = 0 and Pi =

∑ik=1 pk, then

P [Yj ≤ xi] =n∑

k=j

(n

k

)P ki (1− Pi)n−k

and

P [Yj = xi] =n∑

k=j

(n

k

)(P ki (1− Pi)n−k − P k

i−1(1− Pi−1)n−k).

Example 11.23. If U1, U2, . . . , Uk are independently uniformly distributed on the interval0 to t0, then they have joint pdf

fUk1

(uk1)

=

1tk0, 0 ≤ ui ≤ t0

0, otherwise

The order statistics τ1, τ2, . . . , τk corresponding to U1, U2, . . . , Uk have joint pdf

fτk1

(tk1)

=

k!tk0, 0 ≤ t1 ≤ t2 ≤ · · · ≤ tk ≤ t0

0, otherwise

Example 11.24 (n = 2). Suppose U = max(X, Y ) and V = min(X, Y ) where X and Yhave joint cdf FXY .

FU,V (u, v) =

FX,Y (u, u), u ≤ v,FX,Y (v, u) + FX,Y (u, v)− FX,Y (v, v), u > v

,

FU(u) = FX,Y (u, u),

FV (v) = FX(v)− FY (v)− FX,Y (v, v).

[9, Ex 7.5, 7.6]. The joint density is

fU,V (u, v) = fX,Y (u, v) + fX,Y (v, u), v < u.

108

The marginal densities is given by

fU (u) =∂

∂xFX,Y (x, y)

∣∣∣∣x=u,y=u

+∂

∂yFX,Y (x, y)

∣∣∣∣x=u,y=u

=

u∫

−∞

fX,Y (x, u)dx+

u∫

−∞

fX,Y (u, y) dy,

fV (v) = fX (v) + fY (v)− fU (v)

= fX (v) + fY (v)−(∂

∂xFX,Y (x, y)

∣∣∣∣x=v,y=v

+∂

∂yFX,Y (x, y)

∣∣∣∣x=v,y=v

)

= fX (v) + fY (v)−

v∫

−∞

fX,Y (x, u)dx+

v∫

−∞

fX,Y (u, y) dy

.

=

∫ ∞

v

fX,Y (v, y)dy +

∫ ∞

v

fX,Y (x, v)dx

If, furthermore, X and Y are independent, then

FU,V (u, v) =

FX (u)FY (u) , u ≤ vFX (v)FY (u) + FX (u)FY (v)− FX (v)FY (v) , u > v,

FU (u) = FX (u)FY (u) ,

FV (v) = FX (v) + FY (v)− FX (v)FY (v) ,

fU (u) = fX (u)FY (u) + FX (u) fY (u) ,

fV (v) = fX (v) + fY (v)− fX (v)FY (v)− FX (v) fY (v) .

If, furthermore, X and Y are i.i.d., then

FU (u) = F 2 (u) ,

FV (v) = 2F (v)− F 2 (v) ,

fU (u) = 2f (u)F (u) ,

fV (v) = 2f (v) (1− F (v)) .

11.25. Let the Xi be i.i.d. with density f and cdf F . The range R is defined as R =Xmax −Xmin.

FR (x) =

0, x < 0

n∫

(F (v)− F (v − x))n−1 f (v) dv, x ≥ 0

fR (x) =

0, x < 0

n (n− 1)∫

(F (v)− F (v − x))n−2 f (v − x) f (v) dv, x > 0.

For example, when Xii.i.d.∼ U(0, 1),

fR (x) = n (n− 1)xn−2 (1− x) , 0 ≤ x ≤ 1

[16, Ex 9F p 322–323].

109

12 Convergences

Definition 12.1. A sequence of random variables (Xn) converges pointwise to X if∀ω ∈ Ω lim

n→∞Xn (ω)→ X (ω)

Definition 12.2 (Strong Convergence). The following statements are all equivalent con-ditions/notations for a sequence of random variables (Xn) to converge almost surely to arandom variable X

(a) Xna.s.−−→ X

(i) Xn → X a.s.

(ii) Xn → X with probability 1.

(iii) Xn → X w.p. 1

(iv) limn→∞

Xn = Xa.s.

(b) (Xn −X)a.s.−−→ 0

(c) P [Xn → X] = 1

(i) P[ω : lim

n→∞Xn (ω) = X (ω)

]= 1

(ii) P[ω : lim

n→∞Xn (ω) = X (ω)

c]= 0

(iii) P[ω : lim

n→∞Xn(ω) 6= X(ω)

]= 0

(iv) P [Xn 9 X] = 0

(v) P[ω : lim

n→∞|Xn (ω)−X (ω)| = 0

]= 1

(d) ∀ε > 0 P[ω : lim

n→∞|Xn (ω)−X (ω)| < ε

]= 1

12.3. Properties of convergence a.s.

(a) Uniqueness: if Xna.s.−−→ X and Xn

a.s.−−→ Y , then X = Y a.s.

(b) If Xna.s.−−→ X and Yn

a.s.−−→ Y , then

(i) Xn + Yna.s.−−→ X + Y

(ii) XnYna.s.−−→ XY

(c) g continuous, Xna.s.−−→ X ⇒ g (Xn)

a.s.−−→ g (X)

(d) Suppose that ∀ε > 0,∞∑n=1

P [|Xn −X| > ε] <∞, then Xna.s.−−→ X

(e) Let A1, A2, . . . be independent. Then, 1Ana.s.−−→ 0 if and only if

∑n

P (An) <∞

110

Definition 12.4 (Convergence in probability). The following statements are all equivalentconditions/notations for a sequence of random variables (Xn) to converges in probabilityto a random variable X

(a) XnP−→ X

(i) Xn →P X

(ii) p limn→∞

Xn = X

(b) (Xn −X)P−→ 0

(c) ∀ε > 0 limn→∞

P [|Xn −X| < ε] = 1

(i) ∀ε > 0 limn→∞

P (ω : |Xn (ω)−X (ω)| > ε) = 0

(ii) ∀ε > 0 ∀δ > 0 ∃Nδ ∈ N such that ∀n ≥ Nδ P [|Xn −X| > ε] < δ

(iii) ∀ε > 0 limn→∞

P [|Xn −X| > ε] = 0

(iv) The strict inequality between |Xn −X| and ε can be replaced by the correspond-ing “non-strict” version.

12.5. Properties of convergence in probability

(a) Uniqueness: If XnP−→ X and Xn

P−→ Y , then X = Y a.s.

(b) Suppose XnP−→ X, Yn

P−→ Y , and an → a, then

(i) Xn + YnP−→ X + Y

(ii) anXnP−→ aX

(iii) XnYnP−→ XY

(c) Suppose (Xn) i.i.d. with distribution U [0, θ]. Let Zn = max Xi : 1 ≤ i ≤ n. Then,

ZnP−→ θ

(d) g continuous, XnP−→ X ⇒ g (Xn)

P−→ g (X)

(i) Suppose that g : Rd → R is continuous. Then, ∀i Xi,nP−→

n→∞Xi implies

g (X1,n, . . . , Xd,n)P−→

n→∞g (X1, . . . , Xd)

(e) Let g be a continuous function at c. Then, XnP−→ c⇒ g (Xn)

P−→ g (c)

(f) Fatou’s lemma : 0 ≤ XnP−→ X ⇒ lim inf

n→∞EXn ≥ EX

(g) Suppose XnP−→ X and |Xn| ≤ Y with EY <∞, then EXn → EX

111

(h) Let A1, A2, . . . be independent. Then, 1AnP−→ 0 iff P (An)→ 0

(i) XnP−→ 0 iff ∃δ > 0 such that ∀t ∈ [−δ, δ] we have ϕXn (t)→ 1

Definition 12.6 (Weak convergence for probability measures). Let Pn and P be probabilitymeasure on Rd (d ≥ 1). The sequence Pn converges weakly to P if the sequence of realnumbers

∫gdPn →

∫gdP for any g which is real-valued, continuous, and bounded on Rd

12.7. Let (Xn), X be Rd-valued random variables with distribution functions (Fn), F ,distributions (µn) , µ, and ch.f. (ϕn) , ϕ respectively.

The following are equivalent conditions for a sequence of random variables (Xn) toconverge in distribution to a random variable X

(a) (Xn) converges in distribution (or in law) to X

(i) Xn ⇒ X

i. XnL−→ X

ii. XnD−→ X

(ii) Fn ⇒ F

i. FXn ⇒ FX

ii. Fn converges weakly to F

iii. limn→∞

PXn (A) = PX (A) for every A of the form A = (−∞, x] for which

PX x = 0

(iii) µn ⇒ µ

i. PXn converges weakly to PX

(b) Skorohod’s theorem : ∃ random variables Yn and Y on a common probability space

(Ω,F , P ) such that YnD=Xn, Y

D=X, and Yn → Y on (the whole) Ω

(c) limn→∞

Fn = F for all continuity points of F

(i) FXn (x)→ FX (x)∀x such that P [X = x] = 0

(d) ∃ a (countable) set D dense in R such that Fn (x)→ F (x)∀x ∈ D

(e) Continuous Mapping theorem : limn→∞

Eg (Xn) = Eg (X)for all g which is real-

valued, continuous, and bounded on Rd

(i) limn→∞

Eg (Xn) = Eg (X) for all bounded real-valued function g such that P [X ∈ Dg] =

0 where Dg is the set of points of discontinuity of g

(ii) limn→∞

Eg (Xn) = Eg (X) for all bounded Lipschitz continuous functions g

(iii) limn→∞

Eg (Xn) = Eg (X) for all bounded uniformly continuous functions g

112

(iv) limn→∞

Eg (Xn) = Eg (X) for all complex-valued functions g whose real and imag-

inary parts are bounded and continuous

(f) Continuity Theorem : ϕXn → ϕX

(i) For nonnegative random variables: ∀s ≥ 0 LXn (s) → LX (s) where LX (s) =Ee−sX

Note that there is no requirement that (Xn) and X be defined on the same probabilityspace (Ω,A, P )

12.8. Continuity Theorem : Suppose limn→∞

ϕn (t) exists ∀t; call this limit ϕ∞ Further-

more, suppose ϕ∞ is continuous at 0. Then there exists ∃ a probability distribution µ∞such that µn ⇒ µ∞ and ϕ∞ is the characteristic function of µ∞

12.9. Properties of convergence in distribution

(a) If Fn ⇒ F and Fn ⇒ G, then F = G

(b) Suppose Xn ⇒ X

(i) If P [X is a discontinuity point of g] = 0, then g (Xn)⇒ g (X)

(ii) Eg (Xn)→ Eg (X) for every bounded real-valued function g such that P [X ∈ Dg] =0 where Dg is the set of points of discontinuity of g

i. g (Xn)⇒ g (X) for g continuous.

(iii) If YnP−→ 0, then Xn + Yn ⇒ X

(iv) If Xn − Yn ⇒ 0, then Yn ⇒ X

(c) If Xn ⇒ a and g is continuous at a, then g (Xn)⇒ g (a)

(d) Suppose (µn) is a sequence of probability measures on R that are all point masseswith µn (αn) = 1. Then, µn converges weakly to a limit µ iff αn → α; and in thiscase µ is a point mass at α

(e) Scheffe’s theorem :

(i) Suppose PXn and PX have densities δn and δ w.r.t. the same measure µ. Then,δn → δ µ-a.e. implies

i. ∀B ∈ BR PXn (E)→ PX (E)

• FXn → FX

• FXn ⇒ FX

ii. Suppose g is bounded. Then,∫g (x)PXn (dx)→

∫g (x)PX (dx). In Equiv-

alently, E [g (Xn)]→ E [g (X)] where the E is defined with respect to appro-priate P .

(ii) Remarks :

113

i. For absolutely continuous random variables, µ is the Lebesgue measure, δis the probability density function.

ii. For discrete random variables, µ is the counting measure, δ is the probabilitymass function.

(f) Normal r.v.

(i) Let Xn ∼ N (µn, σ2n). Suppose µn → µ ∈ R and σ2

n → σ2 ≥ 0. Then, Xn ⇒N (µ, σ2)

(ii) Suppose that Xn are normal random variables and let Xn ⇒ X. Then, (1) themean and variance of Xn converge to some limit m and σ2. (2) X is normalwith mean m and variance σ2

(g) Xn ⇒ 0 if and only if ∃δ > 0 such that ∀t ∈ [−δ, δ] we have ϕXn (t)→ 1

12.10 (Convergence in distribution of products of randomvariables). Let (Xn,k) be atriangular array of random variables in (0, c], where c < 1, and let X be a nonnegativerandom variable. Assume

∑

k

X2n,k ⇒ 0 or sup

kXn,k ⇒ 0 as n→∞.

Then as n→∞,

∏

k

(1−Xn,k)⇒ e−X if and only if∑

k

Xn,k ⇒ X

[10]. See also A.4.

12.11. Relationship between convergences

(a) For discrete probability space, Xna.s.−−→ X if and only if Xn

P−→ X

(b) Suppose XnP−→ X, ∀n |Xn| ≤ Y , Y ∈ Lp. Then, X ∈ Lp and Xn

Lp−→ X

(c) Suppose Xna.s.−−→ X, ∀n |Xn| ≤ Y , Y ∈ Lp. Then, X ∈ Lp and Xn

Lp−→ X

(d) XnP−→ X ⇒ Xn

D−→ X

(e) If XnD−→ X and if ∃a ∈ RX = a a.s., then Xn

P−→ X

(i) Hence, when X = a a.s., XnP−→ X and Xn

D−→ X are equivalent.

See also Figure 22.

Example 12.12.

(a) Ω = [0, 1] , P is uniform on [0,1]. Xn (ω) =

0, ω ∈

[0, 1− 1

n2

]

en, ω ∈(1− 1

n2 , 1]

114

pL

nX X⎯⎯→

PnX X⎯⎯→

a.s.nX X⎯⎯→

pnX Y L≤ ∈

f ( ⋅ ) continuous

nX X⎯⎯→D

c∃ ( ) 1P X c= =

( ) ( )lim nnF x F x

→∞=

( ) ( ) :x x F x F x−∀ ∈ =

dense in R x D∀ ∈ dense in R

nX Xf f⎯⎯→

( )knX

( )knX

pnX Y L≤ ∈

f ( ) continuous ⋅

p pnX X→E E

f ( ⋅ ) continuous

a) For discrete probability space, a.s.

nX X⎯⎯→ if and only if PnX X⎯⎯→ .

b) Suppose PnX X⎯⎯→ , n∀ nX Y≤ , pY L∈ . Then, pX L∈ and

pLnX X⎯⎯→ .

c) Suppose a.s.nX X⎯⎯→ , n∀ nX Y≤ , pY L∈ . Then, pX L∈ and

pLnX X⎯⎯→ .

d) PnX X⎯⎯→ ⇒ nX X⎯⎯→D

e) If nX X⎯⎯→D and if a∃ ∈ X a= a.s., then PnX X⎯⎯→

i) Hence, when a.s., X a= PnX X⎯⎯→ and nX X⎯⎯→D are equivalent.

f) Example:

i) [ ]0,1Ω = , is uniform on [0,1]. P ( )2

2

10, 0,1

1, 1n

n

nX

en

ωω

ω

⎧

,1

⎡ ⎤∈ −⎪ ⎢ ⎥⎪ ⎣ ⎦= ⎨⎛ ⎤⎪ ∈ −⎜ ⎥⎪ ⎝ ⎦⎩

(1) pL

nX ⎯⎯→ 0

(2) . . 0a snX ⎯⎯→

(3) 0PnX ⎯⎯→

ii) Let [ ]0,1Ω = with uniform probability distribution. Define

( ) ( )1,1n j j

k k

X ω ω −⎡ ⎤⎢ ⎥⎣ ⎦

= + ω , where ( )1 1 8 12

k n⎡ ⎤= + −⎢ ⎥⎢ ⎥ and

(1 12

j n k k= − − ) . The sequence of intervals 1,j jk k

⎡ ⎤−⎢ ⎥⎣ ⎦

under the indicator

function is shown below:

Figure 22: Relationship between convergences

(i) XnLp9 0

(ii) Xna.s.−−→ 0

(iii) XnP−→ 0

(b) Let Ω = [0, 1] with uniform probability distribution. Define Xn (ω) = ω+1[ j−1k, jk ]

(ω) ,

where k =⌈

12

(√1 + 8n− 1

)⌉and j = n − 1

2k (k − 1). The sequence of intervals[

j−1k, jk

]under the indicator function is shown in Figure 23.

Let X (ω) = ω.

(i) The sequence of real numbers (Xn (ω)) does not converge for any ω

(ii) XnLp−→ X

(iii) XnP−→ X

(c) Xn =

1n, w.p. 1− 1

n

n, w.p. 1n

.

(i) XnP−→ 0

(ii) XnLp9 0

115

[ ]0,1

1 10, ,12 2

⎡ ⎤ ⎡ ⎤→⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦

1 1 2 20, , ,13 3 3 3

⎡ ⎤ ⎡ ⎤ ⎡ ⎤→ →⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦

1 1 1 1 3 30, , , ,14 4 2 2 4 4

⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤→ → →⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦

Let ( )X ω ω= .

(1) The sequence of real numbers ( )( )nX ω does not converge for any ω .

(2) pL

nX X⎯⎯→

(3) PnX X⎯⎯→

iii)

1 1, w.p. 1

1, w.p. n

n nXn

n

⎧ −⎪⎪= ⎨⎪⎪⎩

.

(1) 0PnX ⎯⎯→

(2) pL

nX ⎯⎯→ 0

15)

Summation of random variables

Set and where 1

n

n ii

S X=

= ∑ ( ) ( )

1

nc c

ni

S X=

= ∑ i( ) 1cn n nX X X c= ⎡ ≤ ⎤⎣ ⎦ is nX truncated at c.

16) Two sequence ( )nX and ( are tail equivalent if )nY [ ]n nn

P X Y≠ < ∞∑ .

Suppose and ( are tail equivalent ( nX ) )nY [ ]n nn

P X Y⎛ ⎞≠ < ∞⎜ ⎟⎝ ⎠∑ , then

a) [ ]( )liminf 1n nnP X Y

→∞= =

b) [ ]liminf n nnX Yω

→∞∀ ∈ = such that N∃ ∈ n N∀ ≥ ( ) (n nX Y )ω ω= .

c) With probability 1, the sequence ( )nX converges iff the sequence converges.

( )nY

Figure 23: Diagram for example (12.12)

12.1 Summation of random variables

Set Sn =n∑i=1

Xi.

12.13 (Markov’s theorem; Chebyshev’s inequality). For finite variance (Xi), if 1n2 VarSn →

0, then 1nSn − 1

nESn

P−→ 0

• If (Xi) are pairwise independent, then 1n2 VarSn = 1

n2

n∑i=1

VarXi See also (12.17).

12.2 Summation of independent random variables

12.14. For independent Xn, the probability thatn∑i=1

Xi converges is either 0 or 1.

12.15 (Kolmogorov’s SLLN). Consider a sequence (Xn) of independent random variables.If∑n

VarXnn2 <∞, then 1

nSn − 1

nESn

a.s.−−→ 0

• In particular, for independent Xn, if 1n

n∑i=1

EXn → a or EXn → a, then 1nSn

a.s.−−→ a

12.16. Suppose that X1, X2, . . . are independent and that the series X = X1 + X2 + · · ·converges a.s. Then, ϕX = ϕX1ϕX2 · · ·

12.17. For pairwise independent (Xi) with finite variances, if 1n2

n∑i=1

VarXi → 0, then

1nSn − 1

nESn

P−→ 0

(a) Chebyshev’s Theorem : For pairwise independent (Xi) with uniformly bounded

variances, then 1nSn − 1

nESn

P−→ 0

116

12.3 Summation of i.i.d. random variable

Let (Xi) be i.i.d. random variables.

12.18 (Chebyshev’s inequality). P

[∣∣∣∣ 1n

n∑i=1

Xn − EX1

∣∣∣∣ ≥ ε

]≤ 1

ε2Var[X1]

n

• The Xi’s don’t have to be independent; they only need to be pairwise uncorrelated.

12.19 (WLLN). Weak Law of Large Numbers:

(a) L2 weak law: (finite variance) Let (Xi) be uncorrelated (or pairwise independent)random variables

(i) V arSn =n∑i=1

VarXi

(ii) If 1n2

n∑i=1

VarXi → 0, then 1nSn − 1

nESn

P−→ 0

(iii) If EXi = µ and V arXi ≤ C <∞. Then, 1n

n∑i=1

XiL2

−→ µ

(b) Let (Xi) be i.i.d. random variables with EXi = µ and V arXi = σ2 <∞. Then,

(i) P

[∣∣∣∣ 1n

n∑i=1

Xn − EX1

∣∣∣∣ ≥ ε

]≤ 1

ε2Var[X1]

n

(ii) 1n

n∑i=1

XiL2

−→ µ

(The fact that σ2 <∞ implies µ <∞).

(c) 1n

n∑i=1

XiL2

−→ µ implies 1n

n∑i=1

XiP−→ µ which in turn imply 1

n

n∑i=1

XiD−→ µ

(d) IfXn are i.i.d. random variables such that limt→∞

tP [|X1| > t]→ 0, then 1nSn−EX(n)

1P−→

0

(e) Khintchine’s theorem : If Xn are i.i.d. random variables and E |X1| < ∞, then

EX(n)1 → EX1 and 1

nSn

P−→ EX1.

• No assumption about the finiteness of variance.

12.20 (SLLN).

(a) Kolmogorov’s SLLN : Consider a sequence (Xn) of independent random variables.If∑n

VarXnn2 <∞, then 1

nSn − 1

nESn

a.s.−−→ 0

• In particular, for independentXn, if 1n

n∑i=1

EXn → a or EXn → a, then 1nSn

a.s.−−→ a

117

(b) Khintchin’s SLLN : If Xi’s are i.i.d. with finite mean µ, then 1n

n∑i=1

Xia.s.−−→ µ

(c) Consider the sequence (Xn) of i.i.d. random variables. Suppose EX−1 < ∞ andEX+

1 =∞, then 1nSn

a.s.−−→∞

• Suppose that Xn ≥ 0 are i.i.d. random variables and EXn =∞. Then, 1nSn

a.s.−−→∞

12.21 (Relationship between LLN and the convergence of relative frequency to the proba-bility). Consider i.i.d. Zi ∼ Z. Let Xi = 1A(Zi). Then, 1

nSn = 1

n

∑ni=1 1A(Zi) = rn(A), the

relative frequency of an event A. Via LLN and appropriate conditions, rn(A) convergesto E

[1n

∑ni=1 1A(Zi)

]= P [Z ∈ A].

12.4 Central Limit Theorem (CLT)

Suppose that (Xk)k≥1 is a sequence of i.i.d. random variables with mean m and variance0 < σ2 <∞. Let Sn =

∑nk=1 Xk.

12.22 (Lindeberg-Levy theorem).

(a) Sn−mcσ√n

= 1√n

n∑k=1

Xk−mσ⇒ N (0, 1).

(b) Sn−mc√n

= 1√n

n∑k=1

(Xk −m)⇒ N (0, σ).

To see this, let Zk = Xk−mσ

iid∼Z and Yn =∑n

k=1 Zk. Then, EZ = 0, VarZ = 1, and ϕYn(t) =(ϕZ( t√

n))n

. By approximating ex ≈ 1 + x+ 12x2. We have ϕX (t) ≈ 1 + jtEX − 1

2t2E [X2]

(see also (24)) and

ϕYn(t) =

(1− 1

2

t2

n

)n→ e−

t2

2 .

• The case of Bernoulli(1/2) was derived by Abraham de Moivre around 1733. Thecase of Bernoulli(p) for 0 < p < 1 was considered by Pierre-Simon Laplace [9, p. 208].

12.23 (Approximation of densities and pmfs using the CLT). Approximate the distributionof Sn by N (nm, nσ2).

• FSn(s) ≈ Φ(s−nmσ√n

)

• fSn(s) ≈ 1√2πσ√ne− 1

2

(x−nmσ√n

)2

• If the Xi are integer-valued, then

P [Sn = k] = P

[k − 1

2< Sn ≤ k +

1

2

]≈ 1√

2πσ√ne− 1

2

(k−nmσ√n

)2

[9, eq (5.14), p. 213].

118

Definition 13.11. The conditional variance is defined as

Var[Y |X = x] =

∫(y −m(x))2f(y|x)dy

where m(x) = E [Y |X = x].

13.12. Properties of conditional variance

(a) VarY = E [Var[Y |X]] + Var[E [Y |X]].

In other words, suppose givenX = x, the mean and variance of Y ism(x), v(x). Then,the variance of Y is VarY = E [v(X)] + Var[m(X)]. Recall that for any function g,we have E [g(Y )] = E [E [g(Y )|X]]. Because EY is just a constant, say µ, we candefine g(y) = (y − µ)2 which then implies VarY = E [E [g(Y )|X]]. Note, however,that E [g(Y )|X] and Var[Y |X] are not the same. Suppose conditioned on X, Y hasdistribution Q with mean m(x) and variance v(x). Then, v(x) =

∫(y−m(x))2Q(dy).

However, E [g(Y )|X = x] =∫

(y − µ)2Q(dy); note the use of µ in stead of m(x).Therefore, in general, VarY 6= E [Var[Y |X]].

• All three terms in the expression are nonnegative. VarY is an upper bound foreach of the terms on the RHS.

(b) Suppose N |= (X,Z), then Var[X +N |Z] = Var[X|Z] + VarN .

(c) Var[AX|z] = AVar[X|z]AH .

13.13. Suppose E [Y |X] = X. Then, Cov [X, Y ] = VarX. See also (5.20). This is alsotrue for Y = X +N with X |= N and N is zero-mean noise.

Definition 13.14. µn [Y |X ] = E [(Y − E [Y |X ])n |X ]

13.15. Properties

(a) µ3 [Y ] = E [µ3 [Y |X ]] + µ3 [E [Y |X ]]

µ4 [Y ] = E [µ4 [Y |X ]] + 6E [Var [Y |X ]] Var [E [Y |X ]] + µ4 [E [Y |X ]]

13.3 Conditional Independence

13.16. The following statements are equivalent:conditions for X1, X2, . . . , Xn to be mu-tually independent conditioning on Y (a.s.).

(a) p (xn1 |y ) =n∏i=1

p (xi |y ).

(b) ∀i ∈ [n] \ 1 p(xi∣∣xi−1

1 , y)

= p (xi |y ).

(c) ∀i ∈ [n] Xi and the vector (Xj)[n]\i are independent conditioning on Y .

Example 13.17. Suppose X and Y are independent. Conditioned on another randomvariable Z, it is not true in general that X and Y are still independent. See example(4.46). Recall that Z = X⊕Y which can be rewritten as Y = X⊕Z. Hence, when Z = 0,we must have Y = X.

121

13.18. Suppose we know that fX|Y,Z(x|y, z) = g(x, y); that is fX|Y,Z does not depend onz. Then, conditioned on Y , X and Z are independent. In which case,

fX|Y,Z (x |y, z ) = fX|Y (x |y ) = g (x, y) .

13.19. Suppose we know that fZ|V,U1,U2 (z |v, u1, u2 ) = fZ|V (z |v ) for all z, v, u1, u2, thenconditioned on V , we can conclude that Z and (U1, U2) are independent. This furtherimplies Z and Ui are independent. Moreover,

fZ|V,U1,U2 (z |v, u1, u2 ) = fZ|V,U1 (z |v, u1 ) = fZ|V,U1 (z |v, u2 ) = fZ|V (z |v ) .

14 Real-valued Jointly Gaussian

Definition 14.1. Random vector Rd is jointly Gaussian or jointly normal if andonly if ∀v ∈ Rd, the random variable vTX is Gaussian.

• In order for this definition to make sense when v = 0 or when X has a singularcovariance matrix, we agree that any constant random variable is considered to beGaussian.

• Of course, the mean and variance are vTEX and vTΛXv, respectively.

If X is a Gaussian random vector with mean vector m and covariance matrix Λ, we writeX ∼ N (m,Λ).

14.2. Properties of jointly Gaussian random vector X ∼ N (m,Λ)

(a) m = EX, Λ = Cov [X] = E[(X − EX)(X − EX)T

].

(b) fX (x) = 1

(2π)n2√

det(Λ)e−

12

(x−m)TΛ−1(x−m).

• To remember the form of the above formulas, both exponents have to be scalar.So, we better have (x−m)T Λ−1 (x−m) instead of having the transpose on thelast term. To make this more clear, set Λ = I, then we must have a dot product.Note also that vTAv =

∑k

∑`

vkAk`v`.

• The above formula can be derived by starting form a random vector Z whose

components are i.i.d. N (0, 1). Let X = C12XZ +m. Use (31) and (37).

• For Xii.i.d.∼ N (mi, σ

2i ),

fX(x) =1

(2π)n2

∏i σi

e− 1

2

∑i

(x−miσi

)2

.

In particular, if Xii.i.d.∼ N (0, 1),

fX(x) =1

(2π)n2

e−12xT x (37)

122

• (2π)n2

√det (Λ) =

√det (2πΛ)

• The Gaussian density is constant on the “ellipsoids” centered at m,x ∈ Rn : (x−m)TC−1

X (x−m) = constant.

(c) ϕX (v) = ejvTm− 1

2vTΛv = e

j

(∑iviEXi

)− 1

2

(∑k

∑`vkv`Cov[Xk,X`]

)

.

• This can be derived from definition (14.1) by noting that

ϕX(v) = E[ejv

TX]

= E

ej1

Y︷︸︸︷vTX

is simply ϕY (1) where Y = vTX which by definition is normal.

(d) Random vector Rd is Jointly Gaussian if and only if ∀v ∈ Rd, the random variablevTX is Gaussian.

• Independent Gaussian random variables are jointly Gaussian.

(e) Joint normality is preserved under linear transformation: suppose Y = AX + b, thenY ∼ N

(Am+ b, AΛAT

).

(f) If (X, Y ) jointly Gaussian, then X and Y are independent if and only if Cov [X, Y ] =0. Hence, uncorrelated jointly Gaussian random variables are independent.

(g) Note that the joint density does not exists when the covariance matrix Λ is singular.

(h) For i.i.d. N (µ, σ2):

fXn1

(xn1 ) =1

(2π)n2 σn

exp

− 1

2σ2‖x− µ‖2

=1

(2πσ2)n2

exp

− 1

2σ2

n∑

i=1

x2i +

µ

σ2

n∑

i=1

xi −µ2

2σ2

(i) Third order joint Gaussian moments are 0:

E [(Xi − EXi) (Xj − EXj) (Xk − EXk)] = 0 ∀ i, j, k not necessarily distinct.

In particular, E[(X − EX)3] = 0.

(j) Isserlis’s Theorem : Any forth-order central moment of jointly Gaussian r.v. isexpressible as the a sum of all possible products of pairs of their covariances:

E [(Xi − EXi) (Xj − EXj) (Xk − EXk) (X` − EX`)]= Cov [Xi, Xj ] Cov [Xk, X`] + Cov [Xi, Xk] Cov [Xj , X`] + Cov [Xi, X`] Cov [Xj , Xk] .

Note that 12

(42

)= 3.

123

• In particular, E[(X − EX)4] = 3σ4.

(k) To generate N (m,Λ). First, by spectral theorem, Λ = V DV T where V is orthogo-nal matrix whose columns are eigenvectors of Λ and D is diagonal matrix with theeigenvalues of Λ. The random variable we want is V X +m where X ∼ N (0, D).

14.3. For bivariate normal, fX,Y (x, y) is

1

2πσXσY√

1− ρ2exp

−

(x−EXσX

)2

− 2ρ(x−EXσX

)(y−EYσY

)+(y−EYσY

)2

2 (1− ρ2)

,

where ρ = Cov(X,Y )σXσY

∈ [−1, 1]. Here, x, y ∈ R.

• fX,Y (x, y) = 1σXσY

ψρ

(x−mXσX

, y−mYσY

)

• fX,Y is constant on ellipses of the form(

xσX

)2

+(

yσY

)2

= r2.

• Λ =

(σ2X Cov [X, Y ]

Cov [X, Y ] σ2Y

)=

(σ2X ρσXσY

ρσXσY σ2Y

).

• The following are equivalent:

(a) ρ = 0

(b) Cov [X, Y ] = 0

(c) X and Y are independent.

• |ρ| = 1 if and only if (X − EX) = k (Y − EY ). In which case

ρ = k|k| = sign (k)

|k| = σXσY

• Suppose fX,Y (x, y) only depends on√x2 + y2 and X and Y are independent, then

X and Y are normal with zero mean and equal variance.

• X|Y ∼ N(ρσXσY

(Y −mY ) +mX , σ2X(1− ρ2)

)

Y |X ∼ N(ρ σYσX

(X −mX) +mY , σ2Y (1− ρ2)

)

• The standard bivariate density is defined as

ψρ(u, v) =1

2π√

1− ρ2e− 1

2(1−p)2(u2−2ρuv+v2)

= ψ(u)︸︷︷︸fU (u)

1√1− ρ2

ψ

(v − ρu√

1− ρ2

)

︸︷︷︸fV |U (v|u)

V |U∼N (ρU,1−ρ2)

=1√

1− ρ2ψ

(u− ρv√

1− ρ2

)

︸︷︷︸fU|V (u|v)

U |V∼N (ρV,1−ρ2)

ψ(v)︸︷︷︸fV (v)

124

[9, eq. (7.22) p 309, eq. (7.23) p 311, eq. (7.26) p 313]. This is the joint density ofU, V where U, V ∼ N (0, 1) with Cov [U, V ] = E [UV ] = ρ.

The general bivariate Gaussian pair is obtained from the transformation(X

Y

)=

(σXU +mX

σY V +mY

)=

(σX 00 σY

)(U

V

)+

(mX

mY

).

fU |V (u|v) is N (ρv, 1− ρ2). In other words, U |V ∼ N (ρV, 1− ρ2).

14.4 (Conditional Gaussian).

(a) Suppose (X, Y ) are jointly Gaussian; that is(XY

)∼ N

((µXµY

),

(ΛX ΛXY

ΛY X ΛY

)). Then,

fX|Y (x |y ) is N(E [X |y ] ,ΛX|y

)where E [X |y ] = µX+ΛXY Λ−1

Y (y − µY ) and ΛX|y =ΛX − ΛXY Λ−1

Y ΛY X .

• Note the direction of the formula for ΛX|y .

ΛX → ΛXY

↓ΛY X ← ΛY

(b) Suppose (X, Y,W ) are jointly Gaussian with W |= (X, Y ) . Set V = BX +W . Then,V |y ∼ N

(E [V |y ] ,ΛV |y

)where E [V |y ] = BE [X |y ] + EW and ΛV |y = BΛX|yB

T +ΛW .

15 Bayesian Detection and Estimation

Consider a pair of random vectors Θ and Y , where Θ is not observed, but Y is observed.We know the joint distribution of the pair (Θ, Y ) which is usually given in the form of theprior distribution pΘ(θ) and the conditional distribution pY |Θ(y|θ). By an estimator of Θ

based on Y , we mean a function g such that Θ(Y ) = g(Y ) is our estimate or “guess” ofthe value of Θ.

15.1 (Orthogonality Principle). Let D be a collection of random vectors with the samedimension as Θ. For a random vector Z, suppose that

∀X ∈ D, Z −X ∈ D. (38)

IfE[XT (Θ− Z)

]= 0, ∀X ∈ D, (39)

thenE[|Θ−X|2

]= E

[|Θ− Z|2

]+ E

[|Z −X|2

]+ 2 E[(Z −X︸︷︷︸

∈D

)T (Θ− Z)]

︸︷︷︸=0

,

which impliesE[|Θ− Z|2

]≤ E

[|Θ−X|2

], ∀X ∈ D.

125

• If D is a subspace and Z ∈ D, then (38) is automatically satisfied.

• (39) says that the vector Θ− Z is orthogonal to all vectors in D.

Example 15.2. Suppose Θ and N are independent Poisson random variables with respec-tive parameters λ and µ. Let Z = Θ +N .

• Y is P(λ+ µ)

• Conditioned on Y = y, Θ is B(y, λ

λ+µ

).

• ΘMMSE(Y ) = E [Θ|Y ] = λλ+µ

Y .

• Var[Θ|Y ] = Y λµ(λ+µ)2

, and MSE = E [Var[Θ|Y ]] = λµλ+µ

< VarY = λ+ µ.

See also [7, Q 15.17].

15.3 (Weighted Error). Suppose we define the error by E =(

Θ− Θ (Y ))T

W(

Θ− Θ (Y ))

for some positive definite matrix W . (Note that the usual MSE use W = I.) The MSEEE is uniquely minimized by the MMSE estimator Θ(Y ) = E [Θ|Y ]. The resulting MSEis E

[(Θ− E [Θ|Y ])TW (Θ− E [Θ|Y ])

].

In fact, for any function, g(Y ), the conditional weight error E[

(Θ− g (Y ))T W (Θ− g (Y ))∣∣∣Y]

is given by

E[

(Θ− E [Θ |Y ])T W (Θ− E [Θ |Y ])∣∣∣Y]

+ (E [Θ |Y ]− g (Y ))T W (E [Θ |Y ]− g (Y )) .

Hence, for each Y , it is minimized by having g(Y ) = E [Θ|Y ].

15.4 (Linear minimum mean-squared-error estimator). A linear MMSE estimator ΘLMMSE =gLMMSE(Y ) minimizes the MSE E

[|Θ− g(Y )|2

]among all affine estimators of the form

g(y) = Ay + b.

(a) It is sometimes called Wiener filters.

(b) The scalar linear (affine) MMSE estimator is given by

ΘLMMSE (Y ) = EΘ +Cov [Y,Θ]

VarY(Y − EY ) .

• To see this in Hilbert space, note that we want the orthogonal projection of Θonto the subspace spanned by two elements: Y and 1. The orthogonal basis ofthe subspace is 1, Y − EY . Hence, the orthogonal projection is

〈Θ, 1〉〈1, 1〉 +

〈Θ, Y − EY 〉〈Y − EY, Y − EY 〉 (Y − EY ) .

• The above discussion suggest alternative ways of arriving at the LMMSE byfinding a, b in Θ(Y ) = aY + b such that the error E = Θ− Θ(Y ) is orthogonalto both 1 and Y −EY . The condition 〈E, 1〉 = 0 requires Θ(Y ) to be unbiased.

The condition 〈E, Y − EY 〉 = 0 gives a = Cov[Y,Θ]VarY

.

126

(c) The vector linear (affine) MMSE estimator is given by

ΘLMMSE(Y ) = EΘ + ΣΘY Σ−1Y (Y − EY )

andMMSE = Cov

[Θ− ΘLMMSE(Y )

]= ΣΘ − ΣΘY Σ−1

Y ΣYΘ.

In fact, the optimal choice of A is any solution of

AΣY = ΣΘY .

In which case,

Cov[Θ− ΘLMMSE(Y )

]= ΣΘ − AΣYΘ − ΣΘYA

T + AΣYAT

= ΣΘ − ΣΘYAT = ΣΘ − AΣYΘ

= ΣΘ − AΣYAT .

When ΣY is invertible, A = ΣΘY Σ−1Y . When ΣY is singular, see [9, Q8.38, p. 359].

• The MSE can be rewrite as

E[((Θ− EΘ)− A(Y − EY ))2]+ |EΘ− AEY − b|2 ,

which show that the optimal choice of b is b = EΘ− AEY . This is the b whichmakes the estimator unbiased.

• Fix A. Let Θ = Θ− EΘ and Y = Y − EY . Suppose for any matrix B, we have

E[(BY)T (

Θ− AY)

= 0

].

if and only if, for all matrix B,

E[∣∣∣Θ− AY

∣∣∣2]≤ E

[∣∣∣Θ−BY∣∣∣2].

In which case,

E[∣∣∣Θ−BY

∣∣∣2]

= E[∣∣∣Θ− AY

∣∣∣2]

+ E[∣∣∣(A−B)Y

∣∣∣2].

• Additive Noise in 1-D: Y = Θ +N where Cov [Θ, N ] = 0.

ΘLMMSE(Y ) = EΘ +Cov [Θ, Y ]

VarY(Y − EY )

= EΘ +Var Θ

Var Θ + VarN(Y − (EΘ + EN))

= EΘ +SNR

1 + SNR(Y − (EΘ + EN)) .

and

MMSE = Var Θ− Cov2 [Θ, Y ]

VarY=

Var Θ VarN

Var Θ + VarN.

127

A Math Review

A.1 Inequalities

A.1. By definition,

• ∑∞n=1 an =∑

n∈N an = limN→∞

∑Nn=1 an and

• ∏∞n=1 an =∏

n∈N an = limN→∞

∏Nn=1 an.

A.2. Inequalities involving exponential and logarithm.

(a) For any x,ex ≤ 1 + x

with equality if and only if x = 0.

(b) If we consider x > −1, then we have ln(x+ 1) ≤ x. If we replace x+ 1 by x, then wehave ln(x) ≤ x− 1 for x > 0. If we replace x by 1

x, we have ln(x) ≥ 1− 1

x. This give

the fundamental inequalities of information theory:

1− 1

x≤ ln(x) ≤ x− 1 for x > 0

with equality if and only if x = 1. Alternative forms are listed below.

(i) For x > −1, x1+x≤ ln(1 + x) < x with equality if and only if x = 0.

(ii) For x < 1, x ≤ − ln(1− x) ≤ x1−x with equality if and only if x = 0.

A.3. For |x| ≤ 0.5, we have

ex−x2 ≤ 1 + x ≤ ex. (40)

This is becausex− x2 ≤ ln (1 + x) ≤ x, (41)

which is semi-proved by the plot in Figure 24.

A.4. Consider a triangular array of real numbers (xn,k). Suppose (i)rn∑k=1

xn,k → x and (ii)

rn∑k=1

x2n,k → 0. Then,

rn∏

k=1

(1 + xn,k)→ ex.

Moreover, suppose the sum∑rn

k=1 |xn,k| converges as n→∞ (which automatically impliesthat condition (i) is true for some x). Then, condition (ii) is equivalent to condition (iii)where condition (iii) is the requirement that max

k∈[rn]|xk,n| → 0 as n→∞.

128

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 -1.5

-1

-0.5

0

0.5

1

𝑥

-0.6838

ln 1+ 𝑥

𝑥 − 𝑥2

𝑥

𝑥

𝑥 +1

Figure 24: Bounds for ln(1 + x) when x is small.

Proof. When n is large enough, conditions (ii) and (iii) each implies that |xn,k| ≤ 0.5. (For(ii), note that we can find n large enough such that |xn,k|2 ≤

∑k x

2n,k ≤ 0.52.) Hence, we

can apply (A.3) and get

e

rn∑k=1

xn,k−rn∑k=1

x2n,k ≤

rn∏

k=1

(1 + xn,k) ≤ e

rn∑k=1

xn,k. (42)

Suppose∑rn

k=1 |xn,k| → x0. To show that (iii) implies (ii), let an = maxk∈[rn]

|xk,n|. Then,

0 ≤rn∑

k=1

x2n,k ≤ an

rn∑

k=1

|xn,k| → 0× x0 = 0.

On the other hand, suppose we have (ii). Given any ε > 0, by (ii), ∃n0 such that ∀n ≥ n0,rn∑k=1

x2n,k ≤ ε2. Hence, for any k, x2

n,k ≤rn∑k=1

x2n,k ≤ ε2 and hence |xn,k| ≤ ε which implies

an ≤ ε.

Note that when the xk,n are non-negative, condition (i) already implies that the sum∑rnk=1 |xn,k| converges as n→∞. Alternative versions of A.4 are as followed.

(a) Suppose (ii)rn∑k=1

x2n,k → 0 as n→∞. Then, as n→∞ we have

rn∏

k=1

(1 + xn,k)→ ex if and only ifrn∑

k=1

xn,k → x. (43)

Proof. We already know from A.4 that the RHS of (43) implies the LHS. Also, con-dition (ii) allows the use of A.3 which implies

rn∏

k=1

(1 + xn,k) ≤ e

rn∑k=1

xn,k ≤ e

rn∑k=1

x2n,k

rn∏

k=1

(1 + xn,k). (42b)

129

(b) Suppose the xn,k are nonnegative and (iii) an → 0 as n → ∞. Then, as n → ∞ wehave

rn∏

k=1

(1− xn,k)→ e−x if and only ifrn∑

k=1

xn,k → x. (44)

Proof. We already know from A.4 that the RHS of (43) implies the LHS. Also, con-dition (iii) allows the use of A.3 which implies

rn∏

k=1

(1− xn,k) ≤ e−

rn∑k=1

xn,k ≤ e

rn∑k=1

x2n,k

rn∏

k=1

(1− xn,k). (42c)

Furthermore, by (41), we have

rn∑

k=1

x2n,k ≤ an

(−

rn∑

k=1

ln (1− xn,k))→ 0× x = 0.

A.5. Let αi and βi be complex numbers with |αi| ≤ 1 and |βi| ≤ 1. Then,

∣∣∣∣∣m∏

i=1

αi −m∏

i=1

βi

∣∣∣∣∣ ≤m∑

i=1

|αi − βi|.

In particular, |αm − βm| ≤ m |α− β|.

A.6. Suppose limn→∞

an = a. Then limn→∞

(1− an

n

)n= e−a [9, p 584].

Proof. Use (A.4) with rn = n, xn,k = −ann

. Then,∑n

k=1 xn,k = −an → −a and∑n

k=1 x2n,k =

a2n

1n→ a · 0 = 0.

Alternatively, from L’Hopital’s rule, limn→∞

(1− a

n

)n= e−a. (See also [19, Theorem 3.31,

p 64]) This gives a direct proof for the case when a > 0. For n large enough, note thatboth

∣∣1− ann

∣∣ and∣∣1− a

n

∣∣ are ≤ 1 where we need a > 0 here. Applying (A.5), we get∣∣(1− ann

)n −(1− a

n

)n∣∣ ≤ |an − a| → 0.

For a < 0, we use the fact that, for bn → b > 0, (1)((

1 + bn

)−1)n

=((

1 + bn

)n)−1 → e−b

and (2) for n large enough, both∣∣∣(1 + b

n

)−1∣∣∣ and

∣∣∣(1 + bn

n

)−1∣∣∣ are ≤ 1 and hence

∣∣∣∣∣

((1 +

bnn

)−1)n

−((

1 +b

n

)−1)n∣∣∣∣∣ ≤

|bn − b|(1 + bn

n

) (1 + b

n

) → 0.

130

A.2 Summations

A.7. Basic formulas:

(a)n∑k=0

k = n(n+1)2

(b)n∑k=0

k2 = n(n+1)(2n+1)6

= 16

(2n3 + 3n2 + n)

(c)n∑k=0

k3 =

(n∑k=0

k

)2

= 14n2 (n+ 1)2 = 1

4(n4 + 2n3 + n2)

A nicer formula is given by

n∑

k=1

k (k + 1) · · · (k + d) =1

d+ 2n (n+ 1) · · · (n+ d+ 1) (45)

A.8. Let g (n) =n∑k=0

h (k) where h is a polynomial of degree d. Then, g is a polynomial

of degree d+ 1; that is g (n) =d+1∑m=1

amxm.

• To find the coefficients am, evaluate g (n) for n = 1, 2, . . . , d + 1. Note that the casewhen n = 0 gives a0 = 0 and hence the sum starts with m = 1.

• Alternative, first express h (k) in terms of summation of polynomials:

h (k) =

(d−1∑

i=0

bik (k + 1) · · · (k + i)

)+ c. (46)

To do this, substitute k = 0,−1,−2, . . . ,− (d− 1).

• k3 = k (k + 1) (k + 2)− 3k (k + 1) + k

Then, to get g (n), use (45).

A.9. Geometric Sums:

(a)∞∑i=0

ρi = 11−ρ for |ρ| < 1

(b)∞∑i=k

ρi = ρk

1−ρ

(c)b∑i=a

ρi = ρa−ρb+1

1−ρ

(d)∞∑i=0

iρi = ρ

(1−ρ)2

131

(e)b∑i=a

iρi = ρb+1(bρ−b−1)−ρa(aρ−a−ρ)

(1−ρ)2

(f)∞∑i=k

iρi = kρk

1−ρ + ρk+1

(1−ρ)2

(g)∞∑i=0

i2ρi = ρ+ρ2

(1−ρ)3

A.10. Double Sums:

(a)

(n∑i=1

ai

)2

=n∑i=1

n∑j=1

aiaj

(b)∞∑j=1

∞∑i=j

f (i, j) =∞∑i=1

i∑j=1

f (i, j) =∑(i,j)

1 [i ≥ j]f (i, j)

A.11. Exponential Sums:

• eλ =∞∑k=0

λk

k!= 1 + λ+ λ2

2!+ λ3

3!+ . . .

• λeλ + eλ = 1 + 2λ+ 3λ2

2!+ 4λ

3

3!+ . . . =

∞∑k=1

k λk−1

(k−1)!

A.12. Suppose h is a polynomial of degree d, then

∞∑

k=0

h(k)λk

k!= g(λ)eλ,

where g is another polynomial of the same degree. For example,

∞∑

k=0

k3λk

k!=(λ3 + 3λ2 + λ

)eλ. (47)

This result can be obtained by several techniques.

(a) Start with eλ =∑∞

k=0λk

k!. Then, we have

∞∑

k=0

k3λk

k!= λ

d

dλ

(λd

dλ

(λd

dλeλ))

.

(b) We can expandk3 = k (k − 1) (k − 2) + 3k (k − 1) + k. (48)

similar to (46). Now note that

∞∑

k=0

k (k − 1) · · · (k − (`− 1))λk

k!= λ`eλ.

Therefore, the coefficients of the terms in (48) directly becomes the coefficients in(47)

132

A.13. Zeta function ξ (s) is defined for any complex number s with Re s > 1 by the

Dirichlet series: ξ (s) =∞∑n=1

1ns

.

• For real-valued nonnegative x

(a) ξ (x) converges for x > 1

(b) ξ (x) diverges for 0 < x ≤ 1

[9, Q2.48 p 105].

• ξ (1) =∞ corresponds to harmonic series.

A.14. Abel’s theorem : Let a = (ai : i ∈ N) be any sequence of real or complex numbersand let

Ga(z) =∞∑

i=0

aizi,

be the power series with coefficients a. Suppose that the series∑∞

i=0 ai converges. Then,

limz→1−

Ga(z) =∞∑

i=0

ai. (49)

In the special case where all the coefficients ai are nonnegative real numbers, then theabove formula (49) holds also when the series

∑∞i=0 ai does not converge. I.e. in that case

both sides of the formula equal +∞.

A.3 Calculus

A.3.1 Derivatives

A.15. Basic Formulas

(a) ddxau = au ln adu

dx

(b) ddx

loga u = loga eu

dudx, a 6= 0, 1

(c) Derivatives of the products: Suppose f (x) = g (x)h (x), then

f (n) (x) =n∑

k=0

(n

k

)g(n−k) (x)h(k) (x).

In fact,dn

dtn

r∏

i=1

fi (t) =∑

n1+···+nr=n

n!

n1!n2! · · ·nr!r∏

i=1

dni

dtnifi (t).

133

Definition A.16 (Jacobian). In vector calculus, the Jacobian is shorthand for either theJacobian matrix or its determinant, the Jacobian determinant. Let g be a functionfrom a subset D of Rn to Rm. If g is differentiable at z ∈ D, then all partial derivativesexists at z and the Jacobian matrix of g at a point z ∈ D is

dg (z) =

∂g1∂x1

(z) · · · ∂g1∂xn

(z)...

. . ....

∂gm∂x1

(z) · · · ∂gm∂xn

(z)

=

(∂g

∂x1

(z) , . . . ,∂g

∂xn(z)

).

Alternative notations for the Jacobian matrix are J, ∂(g1,...,gn)∂(x1,...,xn)

[7, p 242], and Jg(x) where

the it is assumed that the Jacobian matrix is evaluated at z = x = (x1, . . . , xn).

• Linear approximation around z:

g(x) ≈ dg(z)(x− z) + g(z)︸︷︷︸`(x)

. (50)

The function ` is the linearization [21] of g at a point z.

• Let g : D → Rm with open D ⊂ Rn, y ∈ D. If ∀k ∀j partial derivatives ∂gk∂xj

exists

in a neighborhood of z and continuous at z, then g is differentiable at z. And dg iscontinuous at z.

• Let A be an n-dimensional “box” defined by the corners x and x+∆x. The “volume”of the image g(A) is (

∏i ∆xi) |det dg(x)|. Hence, the magnitude of the Jacobian

determinant gives the ratios (scaling factor) of n-dimensional volumes (contents). Inother words,

dy1 · · · dyn =

∣∣∣∣∂(y1, . . . , yn)

∂(x1, . . . , xn)

∣∣∣∣ dx1 · · · dxn.

• d(g−1(y)) is the Jacobian of the inverse transformation.

• In MATLAB, use jacobian.

• Change of variable: Let g be a continuous differentiable map of the open set U ontoV . Suppose that g is one-to-one and that det(dg(x)) 6= 0 for all x.

∫

U

h(g(x)) |det(dg(x))| dx =

∫

V

h(y)dy.

A.17. When m = 1, we have a scalar function, and we can talk about the gradient vector.The gradient (or gradient vector field) of a scalar function f(x) with respect to a vectorvariable x = (x1, . . . , xn) is

∇xf (z) =

∂f∂x1

(z)...

∂f∂xn

(z)

= (df(z))T .

134

(a) The RHS of the linear approximation (50) characterizes the tangent hyperplane

L(x) = dg(z)(x− z) + g(z)

at the point z. The level surface that pass through the point z is given by

x : g(x) = g(z) .Using the linear approximation (50), we see that around the point z, the point onthe level surface must satisfy

dg(z)(x− z) ≈ 0.

Hence, the tangent plane at z for the level surface is given by

x : dg(z)(x− z) = 0 .Note also that the gradient vector ∇g(z) = (dg(z))T is perpendicular to the tangentplane through z: for any two point x(1), x(2) on the tangent plane though z,

dg(z)(x(2) − x(1)) = 0.

Hence, the gradient vector at z is perpendicular to the level surface at z.

(b) When n = 2, given a point P = (x0, y0) we have a tangent plane

L(x, y) =

(∂g

∂x(P ) (x− x0) +

∂g

∂x(P ) (y − y0)

)+ g (P )

through P . The level curve that pass through the point P is given by

(x, y) : g(x, y) = g(P ) .Approximating g by L, we find that the tangent line at P for the level curve isgiven by

(x, y) :∂g

∂x(P ) (x− x0) +

∂g

∂x(P ) (y − y0) = 0

.

This line is perpendicular to the gradient vector.

(c) When n = 3, given a point P = (x0, y0, z0), we think about the level surfaceS = (x, y, z) : f(x, y, z) = c where c = f(P ). The tangent plane at the point Pon the level surface S is given by

(x, y, z) :

∂g

∂x(P ) (x− x0) +

∂g

∂y(P ) (y − y0) +

∂g

∂z(P ) (z − z0) = 0

.

This plane is perpendicular to the gradient vector.

A.18. For gradient, if the argument is row vector, then,

∇θ

(fT (θ)

)=

∂f1∂θ1...∂f1∂θn

∂f2∂θ1...∂f2∂θn

. . .

∂fm∂θ1...

∂fm∂θn

.

135

Definition A.19. Given a scalar-valued function f , the Hessian matrix is the squarematrix of second partial derivatives

∇2θf (θ) = ∇θ (∇θf (θ))T =

∂2f∂θ21

· · · ∂2f∂θ1∂θn

.... . .

...∂2f

∂θn∂θ1· · · ∂2

∂θ2n

.

It is symmetric for nice function.

A.20. ∇x

(fT (x)

)= (df (x))T .

A.21. Let f, g : Ω→ Rm. Ω ⊂ Rn. h (x) = 〈f (x) , g (x)〉 : Ω→ R. Then

dh (x) = (f (x))T dg (x) + (g (x))T df (x) .

• For an n × n matrix A, let f (x) = 〈Ax, x〉. Then df (x) = (Ax)T I + xTA =xTAT + xTA.

If A is symmetric, then df (x) = 2xTA. So, ∂∂xj〈Ax, x〉 = 2 (Ax)j.

A.22. Chain rule : If f is differentiable at y and g is differentiable at z = f (y), theng f is differentiable at y and

d (g f) (y)p×n

= dg (z)p×m

df (y)m×n

(matrix multiplication).

• In particular,

d

dtg (x (t) , y (t) , z (t)) =

(∂

∂xg (x, y, z)

)(d

dtx (t)

)+(∂

∂yg (x, y, z)

)(d

dty (t)

)

+(∂

∂zg (x, y, z)

)(d

dtz (t)

)∣∣∣∣(x,y,z)=(x(t),y(t),z(t))

.

A.23. Let f : D → Rm where D ⊂ Rn is open and connected (so arcwise-connected).Then, df (x) = 0 ∀x ∈ D ⇒ f is constant.

A.24. If f is differentiable at y, then all partial and directional derivative exists at y

df (y) =

∇f1 (y)

...∇fm (y)

=

∂f1∂x1

(y) · · · ∂f1∂xn

(y)...

. . ....

∂fm∂x1

(y) · · · ∂fm∂xn

(y)

=

(∂f

∂x1

(y) , . . . ,∂f

∂xn(y)

).

A.25. Inversion (Mapping) Theorem : Let open Ω ⊂ Rn, f : Ω → Rn is C1, c ∈ Ω.df (c)n×n

is bijective, then ∃U open neighborhood of c such that

136

(a) V = f(U) is an open neighborhood of f(c).

(b) f |U : U → V is bijective.

(c) g = (f |U )−1 : V → U is C1.

(d) ∀y ∈ V dg (y) = [df (g (y))]−1.

d (x) = I ∇x

(xT)

= I

d(‖x‖2) = 2xT ∇x ‖x‖2 = 2x

d (Ax+ b) = Ad(aTx

)= aT

∇x

((Ax+ b)T

)= AT

∇x

(aTx

)= a

d(fT (x) g (x)

)

= fT (x) dg (x) + gT (x) df (x)

∇x

(fT (x) g (x)

)

= (dg (x))T f (x) + (df (x))T g (x)

= ∇x

(gT (x)

)f (x) +∇x

(fT (x)

)g (x)

For symmetric Q,

d(fT (x)Qf (x)

)

= fT (x)Qdf (x) + fT (x)QTdf (x)

= 2fT (x)Qdf (x)

For symmetric Q,

∇x

(fT (x)Qf (x)

)

= 2 (df (x))T Qf (x)

= 2∇x

(fT (x)

)Qf (x)

d(xTQx

)= 2xTQ ∇x

(xTQx

)= 2Qx.

d(‖f (x)‖2) = 2fT (x) df (x) ∇x

(‖f (x)‖2) = 2∇x

(fT (x)

)f (x)

A.3.2 Integration

A.26. Basic Formulas

(a)∫audu = au

ln a, a > 0, a 6= 1.

(b)1∫0

tαdt =

1

α+1, α > −1

∞, α ≤ −1and

∞∫1

tαdt =

1

α+1, α < −1

∞, α ≥ −1So, the integration of the

function 1t

is the test case. In fact,1∫0

1tdt =

∞∫1

1tdt =∞.

(c)∫xm lnxdx =

xm+1

m+1

(lnx− 1

m+1

), m 6= −1

12

ln2 x, m = −1

A.27. Integration by Parts :

∫udv = uv −

∫vdu

137

• If n is a positive integer, ( )( )0

1 !!

kax nn ax n k

kk

nex e dx xa a n k

−

=

−=

−∑∫ .

• 11:axen xa a⎛ ⎞= −⎜ ⎟⎝ ⎠

• 22

2 22 :axen x xa a a⎛ ⎞= − +⎜ ⎟⎝ ⎠

• ( )( )

( ) ( )( ) ( ) 11

0 00

1 ! 1 ! 1 ! !! !

k n kt at atn nn ax n k n k

nk n kk k

n n ne e nx e dx t ta a n k a a a n k a

− −++

= =

− − −= − = +

− − −∑ ∑∫

• ( )0 0

! !! !

at atn nn ax n k j

k n kk jt

e n e nx e dx t ta a n k a a j

∞ − −− −

−= =

= =−∑ ∑∫

• ( ) 1

0

!n axn

nx e dxa

∞

+=−∫ , a < 0. (See also Gamma function)

• n! = dtte nt∫∞

−

0

.

• In MATLAB, consider using gamma(n+1) in stead of factorial(n). Note also that gamma() allows vector input.

• 1

0

xx e dxβ −∫ is finite if and only if 1β > − .

0 1 2 3 4 50

1

2

3

44

0

t2

t1.5

t

t0.5

1

t 0.5−

t 1−

50 t

Figure 25: Plots of tα

(a) Basic idea: Start with an integral of the form∫f (x) g (x)dx.

Match this with an integral of the form∫udv by choosing dv to be part of the

integrand including dx and possibly f (x) or g (x).

(b) In particular, repeated application of integration by parts gives

∫f (x) g (x)dx = f (x)G1 (x) +

n−1∑

i=1

(−1)i f (i) (x)Gi+1 (x) + (−1)n∫f (n) (x)Gn (x) dx

(51)where f (i) (x) = di

dxif (x), G1 (x) =

∫g (x)dx, and Gi+1 (x) =

∫Gi (x)dx. Figure 26

can be used to derived (51).

To see this, note that

1 1f x g x dx f x G x f x G x dx , and

1

1 1

n n n

n n nf x G x dx f x G x f x G x dx

.

2 3 2 31 2 2

3 9 27

x xx e dx x x e

sin

sin cos sin

1sin cos

2

x

x x

x

x e dx

x x e x e dx

x x e

1n ax

n ax n axx e nx e dx x e

a a

(Integration by parts).

1

0

1, 1

1

, 1

t dt

1

1, 1

1

, 1

t dt

So, the integration of the function 1

t is the test case. In fact,

1

0 1

1 1dt dt

t t

.

1

1

2

2

1

1

n

n

n

n

f x g x

f x G x

f x G x

f x G x

f x G x

1

1n

1n

+

+ Differentiate Integrate

2 3

3

3

3

12

3

12

9

10

27

x

x

x

x

x e

x e

e

e

+

+

-

-

sin

cos

sin

x

x

x

x e

x e

x e

+

-

+

Figure 26: Integration by Parts


∫f (x) g (x)dx = f (x)G1 (x)−

∫f ′ (x)G1 (x) dx,

and ∫f (n) (x)Gn (x) dx = f (n) (x)Gn+1 (x)−

∫f (n+1) (x)Gn+1 (x) dx.

138


1 1f x g x dx f x G x f x G x dx , and

1

1 1

n n n

n n nf x G x dx f x G x f x G x dx

.

2 3 2 31 2 2

3 9 27

x xx e dx x x e

sin

sin cos sin

1sin cos

2

x

x x

x

x e dx

x x e x e dx

x x e

1n ax

n ax n axx e nx e dx x e

a a

(Integration by parts).

1

0

1, 1

1

, 1

t dt

1

1, 1

1

, 1

t dt

So, the integration of the function 1

t is the test case. In fact,

1

0 1

1 1dt dt

t t

.

1

1

2

2

1

1

n

n

n

n

f x g x

f x G x

f x G x

f x G x

f x G x

1

1n

1n

+

+ Differentiate Integrate

2 3

3

3

3

12

3

12

9

10

27

x

x

x

x

x e

x e

e

e

+

+

-

-

sin

cos

sin

x

x

x

x e

x e

x e

+

-

+

Figure 27: Examples of Integration by Parts using figure 26.

(c)∫xneaxdx = xneax

a− n

a

∫xn−1eax

(d)∫f (x) g′ (x)dx = f (x) g (x)−

∫f ′ (x) g (x)dx.

To see this, start with the product rule: (f(x)g(x))′ = f(x)g′(x) + f ′(x)g(x). Then,integrate both sides.

(e)b∫a

f (x) g′ (x)dx = f (x) g (x)|ba −b∫a

f ′ (x) g (x)dx

A.28. If n is a positive integer,

∫xneaxdx =

eax

a

n∑

k=0

(−1)k n!

ak (n− k)!xn−k.

(a) n = 1 : eax

a

(x− 1

a

)

(b) n = 2 : eax

a

(x2 − 2

ax+ 2

a2

)

(c)t∫

0

xneaxdx = eat

a

n∑k=0

(−1)kn!ak(n−k)!

tn−k − (−1)nn!an+1 = eat

a

n∑k=0

(−1)kn!ak(n−k)!

tn−k + n!(−a)n+1

(d)∞∫t

xne−axdx = e−at

a

n∑k=0

n!ak(n−k)!

tn−k = e−at

a

n∑j=0

n!an−kj!

tj

(e)∞∫0

xneaxdx = n!(−a)n+1 , a < 0. (See also Gamma function)

• n! =∞∫0

e−ttndt.

• In MATLAB, consider using gamma(n+1) in stead of factorial(n). Note alsothat gamma() allows vector input.

(f)1∫0

xβe−xdx is finite if and only if β > −1.

Note that 1e

1∫0

xβdx ≤1∫0

xβe−xdx ≤1∫0

xβdx.

139

(g) ∀β ∈ R,∞∫1

xβe−xdx <∞.

For β ≤ 0,∞∫1

xβe−xdx ≤∞∫1

e−xdx <∞∫0

e−xdx = 1.

For β > 0,∞∫1

xβe−xdx ≤∞∫1

xdβee−xdx ≤∞∫0

xdβee−xdx = dβe!

(h)∞∫0

xβe−xdx is finite if and only if β > −1.

A.29 (Differential of integral). Leibniz’s Rule : Let g : R2 → R, a : R → R, and

b : R→ R be C1. Then f (x) =b(x)∫a(x)

g (x, y)dy is C1 and

f ′ (x) = b′ (x) g (x, b (x))− a′ (x) g (x, a (x)) +

b(x)∫

a(x)

∂g

∂x(x, y)dy. (52)

In particular, we have

d

dx

x∫

a

f (t)dt = f (x) , (53)

d

dx

v(x)∫

a

f (t)dt =dv

dx

d

dv

v(x)∫

a

f (t)dt = f (v (x)) v′ (x) , (54)

d

dx

v(x)∫

u(x)

f (t)dt =d

dx

v(x)∫

a

f (t)dt−u(x)∫

a

f (t)dt

= f (v (x)) v′ (x)− f (u (x))u′ (x) . (55)

Note that (52) can be derived from (A.22) by considering f(x) = h(a(x), b(x), x) where

h (a, b, c) =b∫a

g (c, y)dy. [9, p 318–319].

A.4 Gamma and Beta functions

A.30. Gamma function:

(a) Γ (q) =∞∫0

xq−1e−xdx. ; q > 0.

(b) Γ (n) = (n− 1)! for n ∈ N.Γ (n+ 1) = n! if n ∈ N ∪ 0.

(c) 0! = 1.

140

(d) Γ(

12

)=√π.

(e) Γ (x+ 1) = xΓ (x) (Integration by parts).

• This relationship is used to define the gamma function for negative numbers.

(f) Γ(q)αq

=∞∫0

xq−1e−αxdx, α > 0.

-6 -4 -2 0 2 4 6-20

-15

-10

-5

0

5

10

15

20

x

Figure 28: Plot of gamma function.

(g) By limiting argument (as q → 0+), we have Γ (0) = ∞ which implies Γ(−n) = ∞,for n ∈ N,.

A.31. The incomplete beta function is defined as

B(x; a, b) =

∫ x

0

ta−1 (1− t)b−1 dt.

For x = 1, the incomplete beta function coincides with the (complete) beta function.The regularized incomplete beta function (or regularized beta function for short) is de-

fined in terms of the incomplete beta function and the (complete) beta function:

Ix(a, b) =B(x; a, b)

B(a, b).

• For integers m, k,

Ix(m, k) =m+k−1∑

j=m

(m+ k − 1)!

j!(m+ k − 1− j)!xj(1− x)m+k−1−j.

• I0(a, b) = 0, I1(a, b) = 1

• Ix(a, b) = 1− I1−x(b, a)

141

References

[1] Patrick Billingsley. Probability and Measure. John Wiley & Sons, New York, 1995.5.30, 2b

[2] George Casella and Roger L. Berger. Statistical Inference. Duxbury Press, 2001. 4.43,6.9, 11, 8.13, 11.11, 11.4, 1, 6, 11.21

[3] Donald G. Childers. Probability And Random Processes Using MATLAB. McGraw-Hill, 1997. 1.3, 1.3

[4] Herbert A. David and H. N. Nagaraja. Order Statistics. Wiley-Interscience, 2003.11.4, 11.21

[5] W. Feller. An Introduction to Probability Theory and Its Applications, volume 2. JohnWiley & Sons, 1971. 4.37

[6] William Feller. An Introduction to Probability Theory and Its Applications, Volume 1.Wiley, 3 edition, 1968.

[7] Terrence L. Fine. Probability and Probabilistic Reasoning for Electrical Engineering.Prentice Hall, 2005. 3.1, 3.10, 3.14, 4.18, 4.24, 5.27, 9.1, 9.3, 11.10, 15.2, A.16

[8] B. V. Gnedenko. Theory of probability. Chelsea Pub. Co., New York, 4 edition, 1967.Translated from the Russian by B.D. Seckler. 5.30

[9] John A. Gubner. Probability and Random Processes for Electrical and Computer En-gineers. Cambridge University Press, 2006. 1.2, 4.31, 4.34, 4.36, 4, 6.45, 7.9, 9.9, 10.1,2, 3, 6, 11.24, 12.22, 12.23, 14.3, 3, A.6, A.13, A.29

[10] M.O. Jones and R.F. Serfozo. Poisson limits of sums of point processes and a particle-survivor model. Ann. Appl. Probab, 17(1):265–283, 2007. 12.10

[11] Samuel Karlin and Howard E. Taylor. A First Course in Stochastic Processes. Aca-demic Press, 1975. 4.6, 9.10, 10.1

[12] J. F. C. Kingman. Poisson Processes. Oxford University Press, 1993. ISBN:0198536933. 5.24, 5.25

[13] A.N. Kolmogorov. The Foundations of Probability. 1933. 3.14

[14] Nabendu Pal, Chun Jin, and Wooi K. Lim. Handbook of Exponential and RelatedDistributions for Engineers and Scientists. Chapman & Hall/CRC, 2005. 6.25

[15] Athanasios Papoulis. Probability, Random Variables and Stochastic Processes.McGraw-Hill Companies, 1991. 5.18, 11.5

[16] E. Parzen. Stochastic Processes. Holden Day, 1962. 3, 4, 11.25

[17] Sidney I. Resnick. Adventures in Stochastic Processes. Birkhuser Boston, 1992. 5

142

[18] Kenneth H. Rosen, editor. Handbook of Discrete and Combinatorial Mathematics.CRC, 1999. 1, 4, 5, 6, 7, 8

[19] Walter Rudin. Principles of Mathematical Analysis. McGraw-Hill, 1976. A.6

[20] Gbor J. Szkely. Paradoxes in Probability Theory and Mathematical Statistics. 1986.8.2

[21] George B. Thomas, Maurice D. Weir, Joel Hass, and Frank R. Giordano. Thomas’Calculus. Addison Wesley, 2004. A.16

[22] Tung. Fundamental of Probability and Statistics for Reliability Analysis. 2005. 17, 20

[23] Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference.Springer, 2004. 4.16, 4.18, 4.44, 7.9, 3, 13.1

143

Index

Bayes Theorem, 22Binomial theorem, 12Birthday Paradox, 24

Cauchy-Bunyakovskii-Schwartz Inequality, 81Chevalier de Mere’s Scandal of Arithmetic,

23Coefficient of Variation (CV), 74

Delta function, 19Dice, 23Die, 23Dirac delta function, see Delta function

Event Algebra, 25

False Positives on Diagnostic Tests, 24Fano Factor, 74

gradient, 134gradient vector field, 134

Holder’s Inequality, 80Hessian matrix, 136

Integration by Parts, 137Isserlis’s Theorem, 123

Jacobian, 100, 134Jacobian formulas, 101Jensen’s Inequality, 81

Leibniz’s Rule, 140Lyapounov’s Inequality, 81

Markov’s Inequality, 78Minkowski’s Inequality, 81Monte Hall’s Game, 24multinomial coefficient, 13Multinomial Counting, 13Multinomial Theorem, 15

Order Statistics, 103

Probability of coincidence birthday, 24

Standard Deviation, 74

Total Probability Theorem, 22

uncorrelatednot independent, 77, 95

Zeta function, 133Zipf or zeta random variable, 60

144

introduction to probability for electrical engineering · introduction to probability for...

Documents