probability for machine learning - github pagesi-systems.github.io/hse545/machine learning all/09...
TRANSCRIPT
Probability for Machine Learningby Prof. Seungchul Lee iSystems Design Lab http://isystems.unist.ac.kr/ UNIST
Table of Contents
I. 1. Random Variable (= r.v.)I. Expectation = mean
II. 2. Random Vectors (multivariate R.V.)I. 2.1. Joint density probabilityII. 2.2. Marginal density probabilityIII. 2.3. Conditional probability
III. 3. Bayes RuleIV. 4. Linear Transformation of Random Variables
I. 4.1. For single random variableII. 4.2. Sum of two random variables X and YIII. 4.3. Affine transformation of random vectors
1. Random Variable (= r.v.)
(Rough) Definition: Variable with a probability
Probability that x = a
≜ (x = a) = P (x = a) ⟹ {PX
1) P (x = a) ≥ 02) P (x) = 1∑
all
{ continuous r.v. if x is continuousdiscrete r.v. if x is discrete
Example
: die outcome
Question
Expectation = mean
Example
x
P (X = 1) = P (X = 2) = ⋯ = P (X = 6) =16
y = + : sum of two dicex1 x2(y = 5) = ?PY
E[x] = { xP (x)∑x
xP (x)dx∫x
discrete
continuous
Sample mean E[x]
Variance var[x]
= x ⋅ (∵ uniform distribution assumed)∑x
1m
= E [ ] : mean square deviation from mean(x − E[x])2
2.1. Joint density probabilityJoint density probability models probability of co-occurrence of many r.v.
2. Random Vectors (multivariate R.V.)
x = , n random variables
⎡
⎣⎢⎢⎢⎢x1
x2
⋮xn
⎤
⎦⎥⎥⎥⎥
( = , ⋯ , = )P ,⋯,X1 XnX1 x1 Xn xn
2.2. Marginal density probability
For two r.v.
2.3. Conditional probabilityProbability of one event when we know the outcome of the other
( = )PX1X1 x1
⋮( = )PXnXn xn
P (X)
P (Y )
= P (X, Y = y)∑y
= P (X = x, Y )∑x
( = ∣ = ) = : Conditional prob. of given P ∣X1 X2X1 x1 X2 x2
P ( = , = )X1 x1 X2 x2
P ( = )X2 x2x1 x2
Independent random variableswhen one tells nothing about the other
Example
four dice
probability of
P ( = ∣ = )X1 x1 X2 x2
P ( = ∣ = )X2 x2 X1 x1
P ( = , = )X1 x1 X2 x2
= P ( = )X1 x1
⇕= P ( = )X2 x2
⇕= P ( = )P ( = )X1 x1 X2 x2
, , ,ω1 ω2 ω3 ω4
x
y
= + : sum of the first two diceω1 ω2
= + + + : sum of all four diceω1 ω2 ω3 ω4
[ ] = ?x
y
marginal probability
(x) = (x, y)PX ∑y
PXY
conditional probabilitysuppose we measured
y = 19
(x ∣ y = 19) = ?PX∣Y
Pictorial Explanation
Example
Suppose we have three bins, labeled A, B, and C.Two of the bins have only white balls, and one bin has only black balls.
1) We take one ball, what is the probability that it is white? (white = 1)
2) When a white ball has been drawn from bin C, what is the probability of drawing a white ball from bin B?
3) When two balls have been drawn from two different bins, what is the probability of drawing two whiteballs?
3. Bayes Rule
enables us to swap and in conditional probability
P ( = 1) =X123
P ( = 1 ∣ = 1) =X2 X112
P ( = 1, = 1) = P ( = 1 ∣ = 1)P ( = 1) = ⋅ =X1 X2 X2 X1 X112
23
13
A B
P ( , )X2 X1
∴ P ( ∣ )X2 X1
= P ( ∣ )P ( ) = P ( ∣ )P ( )X2 X1 X1 X1 X2 X2
=P ( ∣ )P ( )X1 X2 X2
P ( )X1
Example
Suppose that in a group of people, 40% are male and 60% are female.50% of the males are smokers, 30% of the females are smokers.Find the probability that a smoker is male
Baye's Rule + conditional probability
Example
History
4. Linear Transformation of Random Variables
x = M or Fy = S or N
P (x = M) = 0.4P (x = F) = 0.6P (y = S ∣ x = M) = 0.5P (y = S ∣ x = F) = 0.3
P (x = M ∣ y = S) = ?
P (x = M ∣ y = S) = = ≈ 0.53P (y = S ∣ x = M)P (x = M)
P (y = S)0.200.38
P (y = S) = P (y = S ∣ x = M)P (x = M) + P (y = S ∣ x = F)P (x = F)= 0.5 × 0.4 + 0.3 × 0.6 = 0.38
≈ 0.1216( )910
20
4.1. For single random variable
4.2. Sum of two random variables and
Note: quality control in manufacturing process
X ↦ Y = aXE[aX]
var(aX)
var(X)
= aE[X]= var(X)a2
= E[(X − E[X] ] = E[(X − ] = E[ − 2X + ])2 X)2 X2 X X2
= E[ ] − 2E[X ] + = E[ ] − 2E[X] +X2 X X2
X2 X X2
= E[ ] − E[XX2 ]2
X Y
Z = X + Y (still univariate)
E[X + Y ]var(X + Y )
cov(X, Y )
= E[X] + E[Y ]= E[(X + Y − E[X + Y ] ] = E[((X − ) + (Y − ) ])2 X Y )2
= E[(X − ] + E[(Y − ] + 2E[(X − (Y − )]X)2 Y )2 X Y
= var(X) + var(Y ) + 2cov(X, Y )
= E[(X − )(Y − )] = E[XY − X − Y + ]X Y Y X XY
= E[XY ] − E[X] − E[Y ] − = E[XY ] − E[X]E[Y ]Y X XY
var(X + Y ) = var(X) + var(Y ) + 2cov(X, Y )
Remark
variance - for univariablecovariance - for bivariable
Covariance two r.v.
Covariance matrix for random vectors
Moments: provide rough clues on probability distribution
4.3. Affine transformation of random vectors
IID random variables
Suppose are IID with mean and variance
cov(x, y) = E[(x − )(y − )]μx μy
cov(X) = E[(X − μ)(X − μ ])T = [ ]cov( , )X1 X1
cov( , )X2 X1
cov( , )X1 X2
cov( , )X2 X2
= [ ]var( )X1
cov( , )X2 X1
cov( , )X1 X2
var( )X2
∫ (x)dx or ∑ (x)dxxkPx xkPx
y = Ax + b
1. E[y] = AE[x] + b
2. cov(y) = A cov(x)AT
{ identically distributedindependent
, , ⋯ ,x1 x2 xm μ σ2
Let x = , then E[x] = , cov(x) =⎡⎣⎢⎢x1
⋮xm
⎤⎦⎥⎥
⎡⎣⎢⎢μ
⋮μ
⎤⎦⎥⎥
⎡
⎣⎢⎢⎢⎢⎢σ2
σ2
⋱σ2
⎤
⎦⎥⎥⎥⎥⎥
Sum of IID random variables ( single r.v.)
Reduce the variance by a factor of Law of large numbers or Central limit theorem
In [1]:
%%javascript $.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')
→
= ⟹ = Ax where A = [ ]Sm
1m∑i=1
m
xi Sm
1m
1 ⋯ 1
E[ ]Sm
var( )Sm
= AE[x] = [ ] = mμ = μ1m
1 ⋯ 1⎡⎣⎢⎢μ
⋮μ
⎤⎦⎥⎥
1m
= A cov(x) = A =AT
⎡
⎣⎢⎢⎢⎢⎢σ2
σ2
⋱σ2
⎤
⎦⎥⎥⎥⎥⎥AT σ2
m
m ⟹
⟶ N (μ, )x ( )σ
m−−√
2