discrete probability distributions - university at buffalo learning srihari sample matlab code...

Machine Learning Srihari

1

Discrete Probability Distributions Sargur N. Srihari


2

Binary Variables

Bernoulli, Binomial and Beta


3

Bernoulli Distribution •  Expresses distribution of Single binary-valued random variable x e {0,1} •  Probability of x=1 is denoted by parameter µ, i.e.,

p(x=1|µ)=µ•  Therefore

p(x=0|µ)=1-µ

•  Probability distribution has the form Bern(x|µ)=µ x (1-µ) 1-x

•  Mean is shown to be E[x]=µ•  Variance is Var[x]=µ (1-µ) •  Likelihood of n observations independently drawn from p(x|µ) is

•  Log-likelihood is

•  Maximum likelihood estimator –  obtained by setting derivative of ln p(D|µ) wrt m equal to zero is

•  If no of observations of x=1 is m then µML=m/N

nn xN

n

xN

nnxpDp −

==

−== ∏∏ 1

11

)1()|()|( µµµµ

)}1ln()1(ln{)|(ln)|(ln11

µµµµ −−+== ∑∑==

n

N

nn

N

nn xxxpDp

∑=

=N

nnML x

N 1

1µ

Jacob Bernoulli 1654-1705


4

Binomial Distribution

•  Related to Bernoulli distribution •  Expresses Distribution of m

– No of observations for which x=1 •  It is proportional to Bern(x|µ) •  Add up all ways of obtaining heads

•  Mean and Variance are

mNm

mN

NmBin −−⎟⎟⎠

⎞⎜⎜⎝

⎛= )1(),|( µµµ

)1(][

),|(][0

µµ

µµ

−=

==∑=

NmVar

NNmmBinmEN

m

Histogram of Binomial for N=10 and m=0.25

Nm

⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟

=N !

m ! N −m( )!

Binomial Coefficients:


5

Beta Distribution •  Beta distribution

•  Where the Gamma function is defined as

•  a and b are hyperparameters that control distribution of parameter µ

•  Mean and Variance

11 )1()()()(),|( −− −

ΓΓ+Γ= ba

bababaBeta µµµ

∫∞

−−=Γ0

1)( dueux ux

a=0.1, b=0.1 a=1, b=1

a=2, b=3 a=8, b=4

Beta distribution as function of µFor values of hyperparameters a and b

baaE+

=][µ)1()(

]var[ 2 +++=

babaabµ


6

Bayesian Inference with Beta •  MLE of µ in Bernoulli is fraction of observations with x=1

–  Severely over-fitted for small data sets

•  Likelihood function takes products of factors of the form µx(1-µ)(1-x)

•  If prior distribution of µ is chosen to be proportional to powers of µ and 1-µ, posterior will have same functional form as the prior –  Called conjugacy

•  Beta has form suitable for a prior distribution of p(µ)


7

Bayesian Inference with Beta •  Posterior obtained by multiplying beta

prior with binomial likelihood yields

–  where l=N-m, which is no of tails –  m is no of heads

•  It is another beta distribution

–  Effectively increase value of a by m and b by l –  As number of observations increases

distribution becomes more peaked

11 )1( ),,,|( −+−+ − blambalmp µµαµ

11 )1( )()()( ),,,|( −+−+ −

+Γ+Γ+++Γ= blam

blamblambalmp µµµ

a=2, b=2

N=m=1, with x=1

a=3, b=2

Illustration of one step in process

µ1(1-µ)0

p(µ)

p(x=1/µ)=

p(µ/x=1)


8

Predicting next trial outcome •  Need predictive distribution of x given observed D

–  From sum and products rule

•  Expected value of the posterior distribution can be shown to be

–  Which is fraction of observations (both fictitious and

real) that correspond to x=1 •  Maximum likelihood and Bayesian results agree in

the limit of infinite observations –  On average uncertainty (variance) decreases with

observed data

�

p(x =1 |D) = p(x =1,µ |D)dµ0

1∫ = p(x =1 |µ)p(µ |D)dµ0

1∫ =

= µp(µ |D)dµ0

1∫ = E[µ |D]

blamamDxp+++

+== )|1(


9

Summary of Binary Distributions

•  Single Binary variable distribution is represented by Bernoulli

•  Binomial is related to Bernoulli – Expresses distribution of number of

occurrences of either 1 or 0 in N trials •  Beta distribution is a conjugate prior for

Bernoulli – Both have the same functional form

Machine Learning Srihari Sample Matlab Code

Probability Distributions

•  Binomial Distribution: –  Probability Density Function : Y = binopdf (X,N,P) returns the binomial probability density function with parameters N and P

at the values in X. –  Random Number Generator: R = binornd (N,P,MM,NN) returns n MM-by-NN matrix of random numbers chosen from a binomial

distribution with parameters N and P

•  Beta Distribution –  Probability Density Function : Y = betapdf (X,A,B) returns the beta probability density function with parameters A and B at the

values in X. –  Random Number Generator: R = betarnd (A,B) returns a matrix of random numbers chosen from the beta distribution with

parameters A and B.


11

Multinomial Variables

Generalized Bernoulli and Dirichlet


Generalization of Binomial

•  Binomial – Tossing a coin – Expresses probability of no of successes in N trials

•  Probability of 3 rainy days in 10 days

•  Multinomial – Throwing a Die – Probability of a given frequency for each value

•  Probability of 3 specific letters in a string of N

•  Probability Calculator – http://stattrek.com/Tables/Multinomial.aspx 12

Histogram of Binomial for N=10 and µ=0.25


13

Generalization of Bernoulli •  Bernoulli distribution x is 0 or 1

Bern(x|µ)=µ x (1-µ) 1-x •  Discrete variable that takes one of K

values (instead of 2) •  Represent as 1 of K scheme

– Represent x as a K-dimensional vector –  If x=3 then we represent it as x=(0,0,1,0,0,0)T

– Such vectors satisfy •  If probability of xk=1 is denoted µk then

distribution of x is given by

TK

K

k

xkkp ),..,(µ where)µ|x( 1

1

µµµ ==∏=

Generalized Bernoulli

11

=∑ =

K

k kx


14

MLE of Generalized Bernoulli Parameters •  Data set D of N ind. observations x1,..xN

– where the nth observation is written as [xn1,.., xnK]

•  Likelihood function has the form

– where mk=Σn xnk is the no. of observations of xk=1 •  Maximum likelihood solution (obtained by

setting derivative wrt µ of log-likelihood to zero) is

�

p(D |µ) = µkxnk

k=1

K

∏n=1

N

∏ = µk

xnkn∑( )k=1

K

∏ = µkmk

k=1

K

∏

NmkML

k =µ

which is fraction of N observations for which xk=1


15

Generalized Binomial Distribution

•  Multinomial distribution (with K-state variable)

– Where the normalization coefficient is the no of ways of partitioning N objects into K groups of size

•  Given by

( ) ∏=

⎟⎟⎠

⎞⎜⎜⎝

⎛=

K

k

mk

kK

k

mmmN

NmmmMult121

21 ..,|.. µµ

kmmm .., 21

!!..!!

.. 2121 kk mmmN

mmmN

=⎟⎟⎠

⎞⎜⎜⎝

⎛

µ

k= 1

k∑


16

Dirichlet Distribution

•  Family of prior distributions for parameters µk of multinomial distribution

•  By inspection of multinomial, form of conjugate prior is

•  Normalized form of Dirichlet distribution

Lejeune Dirichlet 1805-1859

∑∏ =≤≤=

−k kk

K

kkkp 1 and 10 where )|(

1

1 µµµααµ α

∑∏==

− =ΓΓ

Γ=K

kk

K

kk

k

kDir

10

1

1

1

0 where)()...(

)()|( ααµαα

ααµ α


17

Dirichlet over 3 variables

•  Due to summation constraint – Distribution over

space of {µk} is confined to the simplex of dimensionality K-1

– For K=3

∑ =k k 1µ

αk=0.1

αk=1

αk=10

Plots of Dirichlet distribution over the simplex for various settings of parameters αk

Dir(µ |α) =

Γ(α0)

Γ(α1)...Γ(α

3)

µk

αk−1

k=1

3

∏ where α0

= αk

k=1

3

∑


18

Dirichlet Posterior Distribution

•  Multiplying prior by likelihood

•  Which has the form of the Dirichlet distribution

)|()|( ),|(1

1∏=

−+K

k

mk

kkpDpDp αµααµµααµ

)()..(

)(

)|( ),|(

1

1

11

0 ∏=

−+

+Γ+Γ+Γ=

+=K

k

mk

KK

kk

mmN

mDirDp

αµαα

ααµαµ


19

Summary of Discrete Distributions •  Bernoulli (2 states) : Bern(x|µ)=µ x (1-µ) 1-x

– Binomial: •  Generalized Bernoulli (K states):

– Multinomial •  Conjugate priors:

– Binomial is Beta

– Multinomial is Dirichlet

TK

K

k

xkkp ),..,(µ where)µ|x( 1

1

µµµ ==∏=

mNm

mN

NmBin −−⎟⎟⎠

⎞⎜⎜⎝

⎛= )1(),|( µµµ

( ) ∏=

⎟⎟⎠

⎞⎜⎜⎝

⎛=

K

k

mk

kK

k

mmmN

NmmmMult121

21 ..,|.. µµ

11 )1()()()(),|( −− −

ΓΓ+Γ= ba

bababaBeta µµµ

∑∏==

− =ΓΓ

Γ=K

kk

K

kk

k

kDir

10

1

1

1

0 where)()...(

)()|( ααµαα

ααµ α

Nm

⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟

=N !

m ! N −m( )!


20

Distributions: Landscape Discrete- Binary

Discrete- Multivalued

Continuous

Bernoulli

Multinomial

Gaussian

Angular Von Mises

Binomial Beta

Dirichlet

Gamma Wishart Student’s-t Exponential

Uniform


21

Distributions: Relationships Discrete- Binary

Discrete- Multi-valued

Continuous

Bernoulli Single binary variable

Multinomial One of K values = K-dimensional binary vector

Gaussian

Angular Von Mises

Binomial N samples of Bernoulli

Beta Continuous variable between {0,1]

Dirichlet K random variables between [0.1]

Gamma ConjugatePrior of univariate Gaussian precision

Wishart Conjugate Prior of multivariate Gaussian precision matrix

Student’s-t Generalization of Gaussian robust to Outliers Infinite mixture of Gaussians

Exponential Special case of Gamma

Uniform

N=1 Conjugate Prior

Conjugate Prior Large N

K=2

Gaussian-Gamma Conjugate prior of univariate Gaussian Unknown mean and precision

Gaussian-Wishart Conjugate prior of multi-variate Gaussian Unknown mean and precision matrix

discrete probability distributions - university at buffalo learning srihari sample matlab code...

Documents