probability for machine learning - github pagesi-systems.github.io/hse545/machine learning all/09...

Probability for Machine Learningby Prof. Seungchul Lee iSystems Design Lab http://isystems.unist.ac.kr/ UNIST

Table of Contents

I. 1. Random Variable (= r.v.)I. Expectation = mean

II. 2. Random Vectors (multivariate R.V.)I. 2.1. Joint density probabilityII. 2.2. Marginal density probabilityIII. 2.3. Conditional probability

III. 3. Bayes RuleIV. 4. Linear Transformation of Random Variables

I. 4.1. For single random variableII. 4.2. Sum of two random variables X and YIII. 4.3. Affine transformation of random vectors

1. Random Variable (= r.v.)

(Rough) Definition: Variable with a probability

Probability that x = a

≜ (x = a) = P (x = a) ⟹ {PX

1) P (x = a) ≥ 02) P (x) = 1∑

all

{ continuous r.v. if x is continuousdiscrete r.v. if x is discrete

Example

: die outcome

Question

Expectation = mean

Example

x

P (X = 1) = P (X = 2) = ⋯ = P (X = 6) =16

y = + : sum of two dicex1 x2(y = 5) = ?PY

E[x] = { xP (x)∑x

xP (x)dx∫x

discrete

continuous

Sample mean E[x]

Variance var[x]

= x ⋅ (∵ uniform distribution assumed)∑x

1m

= E [ ] : mean square deviation from mean(x − E[x])2

2.1. Joint density probabilityJoint density probability models probability of co-occurrence of many r.v.

2. Random Vectors (multivariate R.V.)

x = , n random variables

⎡

⎣⎢⎢⎢⎢x1

x2

⋮xn

⎤

⎦⎥⎥⎥⎥

( = , ⋯ , = )P ,⋯,X1 XnX1 x1 Xn xn

2.2. Marginal density probability

For two r.v.

2.3. Conditional probabilityProbability of one event when we know the outcome of the other

( = )PX1X1 x1

⋮( = )PXnXn xn

P (X)

P (Y )

= P (X, Y = y)∑y

= P (X = x, Y )∑x

( = ∣ = ) = : Conditional prob. of given P ∣X1 X2X1 x1 X2 x2

P ( = , = )X1 x1 X2 x2

P ( = )X2 x2x1 x2

Independent random variableswhen one tells nothing about the other

Example

four dice

probability of

P ( = ∣ = )X1 x1 X2 x2

P ( = ∣ = )X2 x2 X1 x1

P ( = , = )X1 x1 X2 x2

= P ( = )X1 x1

⇕= P ( = )X2 x2

⇕= P ( = )P ( = )X1 x1 X2 x2

, , ,ω1 ω2 ω3 ω4

x

y

= + : sum of the first two diceω1 ω2

= + + + : sum of all four diceω1 ω2 ω3 ω4

[ ] = ?x

y

marginal probability

(x) = (x, y)PX ∑y

PXY

conditional probabilitysuppose we measured

y = 19

(x ∣ y = 19) = ?PX∣Y

Pictorial Explanation

Example

Suppose we have three bins, labeled A, B, and C.Two of the bins have only white balls, and one bin has only black balls.

1) We take one ball, what is the probability that it is white? (white = 1)

2) When a white ball has been drawn from bin C, what is the probability of drawing a white ball from bin B?

3) When two balls have been drawn from two different bins, what is the probability of drawing two whiteballs?

3. Bayes Rule

enables us to swap and in conditional probability

P ( = 1) =X123

P ( = 1 ∣ = 1) =X2 X112

P ( = 1, = 1) = P ( = 1 ∣ = 1)P ( = 1) = ⋅ =X1 X2 X2 X1 X112

23

13

A B

P ( , )X2 X1

∴ P ( ∣ )X2 X1

= P ( ∣ )P ( ) = P ( ∣ )P ( )X2 X1 X1 X1 X2 X2

=P ( ∣ )P ( )X1 X2 X2

P ( )X1

Example

Suppose that in a group of people, 40% are male and 60% are female.50% of the males are smokers, 30% of the females are smokers.Find the probability that a smoker is male

Baye's Rule + conditional probability

Example

History

4. Linear Transformation of Random Variables

x = M or Fy = S or N

P (x = M) = 0.4P (x = F) = 0.6P (y = S ∣ x = M) = 0.5P (y = S ∣ x = F) = 0.3

P (x = M ∣ y = S) = ?

P (x = M ∣ y = S) = = ≈ 0.53P (y = S ∣ x = M)P (x = M)

P (y = S)0.200.38

P (y = S) = P (y = S ∣ x = M)P (x = M) + P (y = S ∣ x = F)P (x = F)= 0.5 × 0.4 + 0.3 × 0.6 = 0.38

≈ 0.1216( )910

20

4.1. For single random variable

4.2. Sum of two random variables and

Note: quality control in manufacturing process

X ↦ Y = aXE[aX]

var(aX)

var(X)

= aE[X]= var(X)a2

= E[(X − E[X] ] = E[(X − ] = E[ − 2X + ])2 X)2 X2 X X2

= E[ ] − 2E[X ] + = E[ ] − 2E[X] +X2 X X2

X2 X X2

= E[ ] − E[XX2 ]2

X Y

Z = X + Y (still univariate)

E[X + Y ]var(X + Y )

cov(X, Y )

= E[X] + E[Y ]= E[(X + Y − E[X + Y ] ] = E[((X − ) + (Y − ) ])2 X Y )2

= E[(X − ] + E[(Y − ] + 2E[(X − (Y − )]X)2 Y )2 X Y

= var(X) + var(Y ) + 2cov(X, Y )

= E[(X − )(Y − )] = E[XY − X − Y + ]X Y Y X XY

= E[XY ] − E[X] − E[Y ] − = E[XY ] − E[X]E[Y ]Y X XY

var(X + Y ) = var(X) + var(Y ) + 2cov(X, Y )

Remark

variance - for univariablecovariance - for bivariable

Covariance two r.v.

Covariance matrix for random vectors

Moments: provide rough clues on probability distribution

4.3. Affine transformation of random vectors

IID random variables

Suppose are IID with mean and variance

cov(x, y) = E[(x − )(y − )]μx μy

cov(X) = E[(X − μ)(X − μ ])T = [ ]cov( , )X1 X1

cov( , )X2 X1

cov( , )X1 X2

cov( , )X2 X2

= [ ]var( )X1

cov( , )X2 X1

cov( , )X1 X2

var( )X2

∫ (x)dx or ∑ (x)dxxkPx xkPx

y = Ax + b

1. E[y] = AE[x] + b

2. cov(y) = A cov(x)AT

{ identically distributedindependent

, , ⋯ ,x1 x2 xm μ σ2

Let x = , then E[x] = , cov(x) =⎡⎣⎢⎢x1

⋮xm

⎤⎦⎥⎥

⎡⎣⎢⎢μ

⋮μ

⎤⎦⎥⎥

⎡

⎣⎢⎢⎢⎢⎢σ2

σ2

⋱σ2

⎤

⎦⎥⎥⎥⎥⎥

Sum of IID random variables ( single r.v.)

Reduce the variance by a factor of Law of large numbers or Central limit theorem

In [1]:

%%javascript $.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

→

= ⟹ = Ax where A = [ ]Sm

1m∑i=1

m

xi Sm

1m

1 ⋯ 1

E[ ]Sm

var( )Sm

= AE[x] = [ ] = mμ = μ1m

1 ⋯ 1⎡⎣⎢⎢μ

⋮μ

⎤⎦⎥⎥

1m

= A cov(x) = A =AT

⎡

⎣⎢⎢⎢⎢⎢σ2

σ2

⋱σ2

⎤

⎦⎥⎥⎥⎥⎥AT σ2

m

m ⟹

⟶ N (μ, )x ( )σ

m−−√

2

probability for machine learning - github pagesi-systems.github.io/hse545/machine learning all/09...

Documents