discrete probability distributions - university at buffalo learning srihari sample matlab code...
TRANSCRIPT
Machine Learning Srihari
1
Discrete Probability Distributions Sargur N. Srihari
Machine Learning Srihari
2
Binary Variables
Bernoulli, Binomial and Beta
Machine Learning Srihari
3
Bernoulli Distribution • Expresses distribution of Single binary-valued random variable x e {0,1} • Probability of x=1 is denoted by parameter µ, i.e.,
p(x=1|µ)=µ• Therefore
p(x=0|µ)=1-µ
• Probability distribution has the form Bern(x|µ)=µ x (1-µ) 1-x
• Mean is shown to be E[x]=µ• Variance is Var[x]=µ (1-µ) • Likelihood of n observations independently drawn from p(x|µ) is
• Log-likelihood is
• Maximum likelihood estimator – obtained by setting derivative of ln p(D|µ) wrt m equal to zero is
• If no of observations of x=1 is m then µML=m/N
nn xN
n
xN
nnxpDp −
==
−== ∏∏ 1
11
)1()|()|( µµµµ
)}1ln()1(ln{)|(ln)|(ln11
µµµµ −−+== ∑∑==
n
N
nn
N
nn xxxpDp
∑=
=N
nnML x
N 1
1µ
Jacob Bernoulli 1654-1705
Machine Learning Srihari
4
Binomial Distribution
• Related to Bernoulli distribution • Expresses Distribution of m
– No of observations for which x=1 • It is proportional to Bern(x|µ) • Add up all ways of obtaining heads
• Mean and Variance are
mNm
mN
NmBin −−⎟⎟⎠
⎞⎜⎜⎝
⎛= )1(),|( µµµ
)1(][
),|(][0
µµ
µµ
−=
==∑=
NmVar
NNmmBinmEN
m
Histogram of Binomial for N=10 and m=0.25
Nm
⎛
⎝⎜⎜⎜⎜
⎞
⎠⎟⎟⎟⎟
=N !
m ! N −m( )!
Binomial Coefficients:
Machine Learning Srihari
5
Beta Distribution • Beta distribution
• Where the Gamma function is defined as
• a and b are hyperparameters that control distribution of parameter µ
• Mean and Variance
11 )1()()()(),|( −− −
ΓΓ+Γ= ba
bababaBeta µµµ
∫∞
−−=Γ0
1)( dueux ux
a=0.1, b=0.1 a=1, b=1
a=2, b=3 a=8, b=4
Beta distribution as function of µFor values of hyperparameters a and b
baaE+
=][µ)1()(
]var[ 2 +++=
babaabµ
Machine Learning Srihari
6
Bayesian Inference with Beta • MLE of µ in Bernoulli is fraction of observations with x=1
– Severely over-fitted for small data sets
• Likelihood function takes products of factors of the form µx(1-µ)(1-x)
• If prior distribution of µ is chosen to be proportional to powers of µ and 1-µ, posterior will have same functional form as the prior – Called conjugacy
• Beta has form suitable for a prior distribution of p(µ)
Machine Learning Srihari
7
Bayesian Inference with Beta • Posterior obtained by multiplying beta
prior with binomial likelihood yields
– where l=N-m, which is no of tails – m is no of heads
• It is another beta distribution
– Effectively increase value of a by m and b by l – As number of observations increases
distribution becomes more peaked
11 )1( ),,,|( −+−+ − blambalmp µµαµ
11 )1( )()()( ),,,|( −+−+ −
+Γ+Γ+++Γ= blam
blamblambalmp µµµ
a=2, b=2
N=m=1, with x=1
a=3, b=2
Illustration of one step in process
µ1(1-µ)0
p(µ)
p(x=1/µ)=
p(µ/x=1)
Machine Learning Srihari
8
Predicting next trial outcome • Need predictive distribution of x given observed D
– From sum and products rule
• Expected value of the posterior distribution can be shown to be
– Which is fraction of observations (both fictitious and
real) that correspond to x=1 • Maximum likelihood and Bayesian results agree in
the limit of infinite observations – On average uncertainty (variance) decreases with
observed data
�
p(x =1 |D) = p(x =1,µ |D)dµ0
1∫ = p(x =1 |µ)p(µ |D)dµ0
1∫ =
= µp(µ |D)dµ0
1∫ = E[µ |D]
blamamDxp+++
+== )|1(
Machine Learning Srihari
9
Summary of Binary Distributions
• Single Binary variable distribution is represented by Bernoulli
• Binomial is related to Bernoulli – Expresses distribution of number of
occurrences of either 1 or 0 in N trials • Beta distribution is a conjugate prior for
Bernoulli – Both have the same functional form
Machine Learning Srihari Sample Matlab Code
Probability Distributions
• Binomial Distribution: – Probability Density Function : Y = binopdf (X,N,P) returns the binomial probability density function with parameters N and P
at the values in X. – Random Number Generator: R = binornd (N,P,MM,NN) returns n MM-by-NN matrix of random numbers chosen from a binomial
distribution with parameters N and P
• Beta Distribution – Probability Density Function : Y = betapdf (X,A,B) returns the beta probability density function with parameters A and B at the
values in X. – Random Number Generator: R = betarnd (A,B) returns a matrix of random numbers chosen from the beta distribution with
parameters A and B.
Machine Learning Srihari
11
Multinomial Variables
Generalized Bernoulli and Dirichlet
Machine Learning Srihari
Generalization of Binomial
• Binomial – Tossing a coin – Expresses probability of no of successes in N trials
• Probability of 3 rainy days in 10 days
• Multinomial – Throwing a Die – Probability of a given frequency for each value
• Probability of 3 specific letters in a string of N
• Probability Calculator – http://stattrek.com/Tables/Multinomial.aspx 12
Histogram of Binomial for N=10 and µ=0.25
Machine Learning Srihari
13
Generalization of Bernoulli • Bernoulli distribution x is 0 or 1
Bern(x|µ)=µ x (1-µ) 1-x • Discrete variable that takes one of K
values (instead of 2) • Represent as 1 of K scheme
– Represent x as a K-dimensional vector – If x=3 then we represent it as x=(0,0,1,0,0,0)T
– Such vectors satisfy • If probability of xk=1 is denoted µk then
distribution of x is given by
TK
K
k
xkkp ),..,(µ where)µ|x( 1
1
µµµ ==∏=
Generalized Bernoulli
11
=∑ =
K
k kx
Machine Learning Srihari
14
MLE of Generalized Bernoulli Parameters • Data set D of N ind. observations x1,..xN
– where the nth observation is written as [xn1,.., xnK]
• Likelihood function has the form
– where mk=Σn xnk is the no. of observations of xk=1 • Maximum likelihood solution (obtained by
setting derivative wrt µ of log-likelihood to zero) is
�
p(D |µ) = µkxnk
k=1
K
∏n=1
N
∏ = µk
xnkn∑( )k=1
K
∏ = µkmk
k=1
K
∏
NmkML
k =µ
which is fraction of N observations for which xk=1
Machine Learning Srihari
15
Generalized Binomial Distribution
• Multinomial distribution (with K-state variable)
– Where the normalization coefficient is the no of ways of partitioning N objects into K groups of size
• Given by
( ) ∏=
⎟⎟⎠
⎞⎜⎜⎝
⎛=
K
k
mk
kK
k
mmmN
NmmmMult121
21 ..,|.. µµ
kmmm .., 21
!!..!!
.. 2121 kk mmmN
mmmN
=⎟⎟⎠
⎞⎜⎜⎝
⎛
µ
k= 1
k∑
Machine Learning Srihari
16
Dirichlet Distribution
• Family of prior distributions for parameters µk of multinomial distribution
• By inspection of multinomial, form of conjugate prior is
• Normalized form of Dirichlet distribution
Lejeune Dirichlet 1805-1859
∑∏ =≤≤=
−k kk
K
kkkp 1 and 10 where )|(
1
1 µµµααµ α
∑∏==
− =ΓΓ
Γ=K
kk
K
kk
k
kDir
10
1
1
1
0 where)()...(
)()|( ααµαα
ααµ α
Machine Learning Srihari
17
Dirichlet over 3 variables
• Due to summation constraint – Distribution over
space of {µk} is confined to the simplex of dimensionality K-1
– For K=3
∑ =k k 1µ
αk=0.1
αk=1
αk=10
Plots of Dirichlet distribution over the simplex for various settings of parameters αk
Dir(µ |α) =
Γ(α0)
Γ(α1)...Γ(α
3)
µk
αk−1
k=1
3
∏ where α0
= αk
k=1
3
∑
Machine Learning Srihari
18
Dirichlet Posterior Distribution
• Multiplying prior by likelihood
• Which has the form of the Dirichlet distribution
)|()|( ),|(1
1∏=
−+K
k
mk
kkpDpDp αµααµµααµ
)()..(
)(
)|( ),|(
1
1
11
0 ∏=
−+
+Γ+Γ+Γ=
+=K
k
mk
KK
kk
mmN
mDirDp
αµαα
ααµαµ
Machine Learning Srihari
19
Summary of Discrete Distributions • Bernoulli (2 states) : Bern(x|µ)=µ x (1-µ) 1-x
– Binomial: • Generalized Bernoulli (K states):
– Multinomial • Conjugate priors:
– Binomial is Beta
– Multinomial is Dirichlet
TK
K
k
xkkp ),..,(µ where)µ|x( 1
1
µµµ ==∏=
mNm
mN
NmBin −−⎟⎟⎠
⎞⎜⎜⎝
⎛= )1(),|( µµµ
( ) ∏=
⎟⎟⎠
⎞⎜⎜⎝
⎛=
K
k
mk
kK
k
mmmN
NmmmMult121
21 ..,|.. µµ
11 )1()()()(),|( −− −
ΓΓ+Γ= ba
bababaBeta µµµ
∑∏==
− =ΓΓ
Γ=K
kk
K
kk
k
kDir
10
1
1
1
0 where)()...(
)()|( ααµαα
ααµ α
Nm
⎛
⎝⎜⎜⎜⎜
⎞
⎠⎟⎟⎟⎟
=N !
m ! N −m( )!
Machine Learning Srihari
20
Distributions: Landscape Discrete- Binary
Discrete- Multivalued
Continuous
Bernoulli
Multinomial
Gaussian
Angular Von Mises
Binomial Beta
Dirichlet
Gamma Wishart Student’s-t Exponential
Uniform
Machine Learning Srihari
21
Distributions: Relationships Discrete- Binary
Discrete- Multi-valued
Continuous
Bernoulli Single binary variable
Multinomial One of K values = K-dimensional binary vector
Gaussian
Angular Von Mises
Binomial N samples of Bernoulli
Beta Continuous variable between {0,1]
Dirichlet K random variables between [0.1]
Gamma ConjugatePrior of univariate Gaussian precision
Wishart Conjugate Prior of multivariate Gaussian precision matrix
Student’s-t Generalization of Gaussian robust to Outliers Infinite mixture of Gaussians
Exponential Special case of Gamma
Uniform
N=1 Conjugate Prior
Conjugate Prior Large N
K=2
Gaussian-Gamma Conjugate prior of univariate Gaussian Unknown mean and precision
Gaussian-Wishart Conjugate prior of multi-variate Gaussian Unknown mean and precision matrix