probability for machine learning - github pagesi-systems.github.io/hse545/machine learning all/09...

13
Probability for Machine Learning by Prof. Seungchul Lee iSystems Design Lab http://isystems.unist.ac.kr/ UNIST Table of Contents I. 1. Random Variable (= r.v.) I. Expectation = mean II. 2. Random Vectors (multivariate R.V.) I. 2.1. Joint density probability II. 2.2. Marginal density probability III. 2.3. Conditional probability III. 3. Bayes Rule IV. 4. Linear Transformation of Random Variables I. 4.1. For single random variable II. 4.2. Sum of two random variables X and Y III. 4.3. Affine transformation of random vectors 1. Random Variable (= r.v.) (Rough) Definition: Variable with a probability Probability that

Upload: others

Post on 11-Mar-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probability for Machine Learning - GitHub Pagesi-systems.github.io/HSE545/machine learning all/09 Probability/01_Probability.pdf · Example Suppose we have three bins, labeled A,

Probability for Machine Learningby Prof. Seungchul Lee iSystems Design Lab http://isystems.unist.ac.kr/ UNIST

Table of Contents

I. 1. Random Variable (= r.v.)I. Expectation = mean

II. 2. Random Vectors (multivariate R.V.)I. 2.1. Joint density probabilityII. 2.2. Marginal density probabilityIII. 2.3. Conditional probability

III. 3. Bayes RuleIV. 4. Linear Transformation of Random Variables

I. 4.1. For single random variableII. 4.2. Sum of two random variables X and YIII. 4.3. Affine transformation of random vectors

1. Random Variable (= r.v.)

(Rough) Definition: Variable with a probability

Probability that x = a

≜ (x = a) = P (x = a) ⟹ {PX

1) P (x = a) ≥ 02) P (x) = 1∑

all

{ continuous r.v. if x is continuousdiscrete r.v. if x is discrete

Page 2: Probability for Machine Learning - GitHub Pagesi-systems.github.io/HSE545/machine learning all/09 Probability/01_Probability.pdf · Example Suppose we have three bins, labeled A,

Example

: die outcome

Question

Expectation = mean

Example

x

P (X = 1) = P (X = 2) = ⋯ = P (X = 6) =16

y = + :  sum of two dicex1 x2(y = 5) = ?PY

E[x] = { xP (x)∑x

xP (x)dx∫x

discrete

continuous

Sample mean E[x]

Variance var[x]

= x ⋅ (∵ uniform distribution assumed)∑x

1m

= E [ ] : mean square deviation from mean(x − E[x])2

Page 3: Probability for Machine Learning - GitHub Pagesi-systems.github.io/HSE545/machine learning all/09 Probability/01_Probability.pdf · Example Suppose we have three bins, labeled A,

2.1. Joint density probabilityJoint density probability models probability of co-occurrence of many r.v.

2. Random Vectors (multivariate R.V.)

x = , n random variables

⎣⎢⎢⎢⎢x1

x2

⋮xn

⎦⎥⎥⎥⎥

( = , ⋯ , = )P ,⋯,X1 XnX1 x1 Xn xn

Page 4: Probability for Machine Learning - GitHub Pagesi-systems.github.io/HSE545/machine learning all/09 Probability/01_Probability.pdf · Example Suppose we have three bins, labeled A,

2.2. Marginal density probability

For two r.v.

2.3. Conditional probabilityProbability of one event when we know the outcome of the other

( = )PX1X1 x1

⋮( = )PXnXn xn

P (X)

P (Y )

= P (X, Y = y)∑y

= P (X = x, Y )∑x

( = ∣ = ) = : Conditional prob. of   given P ∣X1 X2X1 x1 X2 x2

P ( = , = )X1 x1 X2 x2

P ( = )X2 x2x1 x2

Page 5: Probability for Machine Learning - GitHub Pagesi-systems.github.io/HSE545/machine learning all/09 Probability/01_Probability.pdf · Example Suppose we have three bins, labeled A,

Independent random variableswhen one tells nothing about the other

Example

four dice

probability of

P ( = ∣ = )X1 x1 X2 x2

P ( = ∣ = )X2 x2 X1 x1

P ( = , = )X1 x1 X2 x2

= P ( = )X1 x1

⇕= P ( = )X2 x2

⇕= P ( = )P ( = )X1 x1 X2 x2

, , ,ω1 ω2 ω3 ω4

x

y

= + : sum of the first two diceω1 ω2

= + + + : sum of all four diceω1 ω2 ω3 ω4

[ ] = ?x

y

Page 6: Probability for Machine Learning - GitHub Pagesi-systems.github.io/HSE545/machine learning all/09 Probability/01_Probability.pdf · Example Suppose we have three bins, labeled A,

marginal probability

(x) = (x, y)PX ∑y

PXY

Page 7: Probability for Machine Learning - GitHub Pagesi-systems.github.io/HSE545/machine learning all/09 Probability/01_Probability.pdf · Example Suppose we have three bins, labeled A,

conditional probabilitysuppose we measured

y = 19

(x ∣ y = 19) = ?PX∣Y

Page 8: Probability for Machine Learning - GitHub Pagesi-systems.github.io/HSE545/machine learning all/09 Probability/01_Probability.pdf · Example Suppose we have three bins, labeled A,

Pictorial Explanation

Page 9: Probability for Machine Learning - GitHub Pagesi-systems.github.io/HSE545/machine learning all/09 Probability/01_Probability.pdf · Example Suppose we have three bins, labeled A,

Example

Suppose we have three bins, labeled A, B, and C.Two of the bins have only white balls, and one bin has only black balls.

1) We take one ball, what is the probability that it is white? (white = 1)

2) When a white ball has been drawn from bin C, what is the probability of drawing a white ball from bin B?

3) When two balls have been drawn from two different bins, what is the probability of drawing two whiteballs?

3. Bayes Rule

enables us to swap and in conditional probability

P ( = 1) =X123

P ( = 1 ∣ = 1) =X2 X112

P ( = 1, = 1) = P ( = 1 ∣ = 1)P ( = 1) = ⋅ =X1 X2 X2 X1 X112

23

13

A B

P ( , )X2 X1

∴ P ( ∣ )X2 X1

= P ( ∣ )P ( ) = P ( ∣ )P ( )X2 X1 X1 X1 X2 X2

=P ( ∣ )P ( )X1 X2 X2

P ( )X1

Page 10: Probability for Machine Learning - GitHub Pagesi-systems.github.io/HSE545/machine learning all/09 Probability/01_Probability.pdf · Example Suppose we have three bins, labeled A,

Example

Suppose that in a group of people, 40% are male and 60% are female.50% of the males are smokers, 30% of the females are smokers.Find the probability that a smoker is male

Baye's Rule + conditional probability

Example

History

4. Linear Transformation of Random Variables

x = M or Fy = S or N

P (x = M) = 0.4P (x = F) = 0.6P (y = S ∣ x = M) = 0.5P (y = S ∣ x = F) = 0.3

P (x = M ∣ y = S) = ?

P (x = M ∣ y = S) = = ≈ 0.53P (y = S ∣ x = M)P (x = M)

P (y = S)0.200.38

P (y = S) = P (y = S ∣ x = M)P (x = M) + P (y = S ∣ x = F)P (x = F)= 0.5 × 0.4 + 0.3 × 0.6 = 0.38

≈ 0.1216( )910

20

Page 11: Probability for Machine Learning - GitHub Pagesi-systems.github.io/HSE545/machine learning all/09 Probability/01_Probability.pdf · Example Suppose we have three bins, labeled A,

4.1. For single random variable

4.2. Sum of two random variables and

Note: quality control in manufacturing process

X ↦ Y = aXE[aX]

var(aX)

var(X)

= aE[X]= var(X)a2

= E[(X − E[X] ] = E[(X − ] = E[ − 2X + ])2 X)2 X2 X X2

= E[ ] − 2E[X ] + = E[ ] − 2E[X] +X2 X X2

X2 X X2

= E[ ] − E[XX2 ]2

X Y

Z = X + Y (still univariate)

E[X + Y ]var(X + Y )

cov(X, Y )

= E[X] + E[Y ]= E[(X + Y − E[X + Y ] ] = E[((X − ) + (Y − ) ])2 X Y )2

= E[(X − ] + E[(Y − ] + 2E[(X − (Y − )]X)2 Y )2 X Y

= var(X) + var(Y ) + 2cov(X, Y )

= E[(X − )(Y − )] = E[XY − X − Y + ]X Y Y X XY

= E[XY ] − E[X] − E[Y ] − = E[XY ] − E[X]E[Y ]Y X XY

var(X + Y ) = var(X) + var(Y ) + 2cov(X, Y )

Page 12: Probability for Machine Learning - GitHub Pagesi-systems.github.io/HSE545/machine learning all/09 Probability/01_Probability.pdf · Example Suppose we have three bins, labeled A,

Remark

variance - for univariablecovariance - for bivariable

Covariance two r.v.

Covariance matrix for random vectors

Moments: provide rough clues on probability distribution

4.3. Affine transformation of random vectors

IID random variables

Suppose are IID with mean and variance

cov(x, y) = E[(x − )(y − )]μx μy

cov(X) = E[(X − μ)(X − μ ])T = [ ]cov( , )X1 X1

cov( , )X2 X1

cov( , )X1 X2

cov( , )X2 X2

= [ ]var( )X1

cov( , )X2 X1

cov( , )X1 X2

var( )X2

∫ (x)dx or ∑ (x)dxxkPx xkPx

y = Ax + b

1.  E[y] = AE[x] + b

2.  cov(y) = A cov(x)AT

{ identically distributedindependent

, , ⋯ ,x1 x2 xm μ σ2

Let x = , then E[x] = , cov(x) =⎡⎣⎢⎢x1

⋮xm

⎤⎦⎥⎥

⎡⎣⎢⎢μ

⋮μ

⎤⎦⎥⎥

⎣⎢⎢⎢⎢⎢σ2

σ2

⋱σ2

⎦⎥⎥⎥⎥⎥

Page 13: Probability for Machine Learning - GitHub Pagesi-systems.github.io/HSE545/machine learning all/09 Probability/01_Probability.pdf · Example Suppose we have three bins, labeled A,

Sum of IID random variables ( single r.v.)

Reduce the variance by a factor of Law of large numbers or Central limit theorem

In [1]:

%%javascript $.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

= ⟹ = Ax where A = [ ]Sm

1m∑i=1

m

xi Sm

1m

1 ⋯ 1

E[ ]Sm

var( )Sm

= AE[x] = [ ] = mμ = μ1m

1 ⋯ 1⎡⎣⎢⎢μ

⋮μ

⎤⎦⎥⎥

1m

= A cov(x) = A =AT

⎣⎢⎢⎢⎢⎢σ2

σ2

⋱σ2

⎦⎥⎥⎥⎥⎥AT σ2

m

m ⟹

⟶ N (μ, )x ( )σ

m−−√

2