exponential family of distributions · 2019-02-06 · exponential family the exponential family of...

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Exponential Family of Distributions

Prof. Nicholas Zabaras

Center for Informatics and Computational Science

https://cics.nd.edu/

University of Notre Dame

Notre Dame, Indiana, USA

Email: [email protected]

URL: https://www.zabaras.com/

September 6, 2018

https://cics.nd.edu/

mailto:[email protected]

https://www.zabaras.com/


References

2

C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to

Compulational Implementation, Springer-Verlag, NY, 2001 (online resource)

A Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis,

Chapman and Hall CRC Press, 2nd Edition, 2003.

J M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online

resource)

D. Sivia and J Skilling, Data Analysis: A Bayesian Tutorial, Oxford University

Press, 2006.

Bayesian Statistics for Engineering, Online Course at Georgia Tech, B.

Vidakovic.

• Chris Bishops’ PRML book, Chapter 2

• M. Jordan, An introduction to Probabilistic Graphical Models, Chapter 8 (pre-

print)

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapters 2 and

4

http://www.ceremade.dauphine.fr/~xian/books.html

http://link.springer.com/book/10.1007/0-387-71599-1/page/1

http://www.amazon.com/Bayesian-Analysis-Edition-Chapman-Statistical/dp/158488388X

http://www.ceremade.dauphine.fr/~xian/BCS/

http://link.springer.com/book/10.1007/978-0-387-38983-7/page/1

http://www.amazon.com/Data-Analysis-Bayesian-Devinderjit-Sivia/dp/0198568320

http://www2.isye.gatech.edu/~brani/isyebayes/

https://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738

http://www.cs.ubc.ca/~murphyk/MLbook/


Exponential Family, Bernoulli, Beta, Gamma, Gaussian

Conjugate Priors, Posterior Predictive

Computing Moments

Moment Parametrization

MLE for the Exponential Family

Maximum Entropy and the Exponential Family

Generalized Linear Models

Contents

3


Exponential Family Large family of useful distributions with common

properties

Bernoulli, beta, binomial, chi-square, Dirichlet,

gamma, Gaussian, geometric, multinomial, Poisson,

Weibull, . .

Not in the family: Uniform,

Student’s T,

Cauchy,

Laplace,

Mixture of Gaussians,

. . .

Variable can be discrete/continuous (or vectors thereof)

4


Exponential Family The exponential family of distributions over 𝒙, given para-

meters 𝜼, is defined to be the set of distributions of the form

𝒙 is scalar/vector, discrete/continuous. 𝜼 are the natural

parameters and 𝒖(𝒙) is referred to as a sufficient statistic.

𝑔(𝜼) ensures that the distribution is normalized and satisfies

The normalization factor 𝑍 and the log of it 𝐴 are defined as:

The space of 𝜼 for which is the

natural parameter space. 5

( | ) ( ) ( ) exp ( )

( | ) ( ) exp ( ) ( ) , : ( ) log ( )

T

T

p h g u or

p h u A where A g

x x x

x x x

( ) ( )exp ( ) 1Tg h u d x x x

1

( ) , ( ) ln ( ) ln ( ) ln ( ) exp ( )( )

TZ A g Z h u dg

x x x

( | ) ( ) exp ( ) ( )Tp h u Z x x x

( ) exp ( )Th u d x x x <


Canonical or Natural Parameters When the parameter 𝜽 enters the exponential family as (q),

we write the probability density of the exponential family as

follows:

𝜼(𝜽) are the canonical or natural parameters,

𝜽 is the parameter vector of some distribution that can be

written in the exponential family format

6

( | ) ( ) exp ( )

( | ) ( ) exp ( ) ,

: log

T

T

p h g u or

p h u A

where A g

q q q

q q q

q q

x x x

x x x


Exponential Family: The Bernoulli Distribution

Consider the Bernoulli distribution:

From this we see that (note that the relation m() is invertible)

and

Finally:

7

1

( )

( | ) ( | ) (1 ) exp ln (1 ) ln(1 )

(1 )exp ln1

x x

g

p x x x x

x

m m m m m m

mm

m

Bern

ln1

m

m

( ) 1 1g m

( | ) ( ) exp , ( ) , ( ) 1, ( ) ( ),

( ) ln ( ) log(1 ) log(1 )

p x g x u x x h x g

A g e

m

( | ) ( ) ( ) exp ( )

( ) exp ( ) ( )

T

T

p h g u

h u A

x x x

x x

Logistic sigmoid

function 1

1 e m

https://en.wikipedia.org/wiki/Bernoulli_distribution


Consider the Beta distribution

Comparing this with our exponential family:

we can easily indentify:

8

1 1( ) ( )( | , ) (1 ) exp ( 1) ln ( 1) ln(1 )

( ) ( ) ( ) ( )

a ba b a ba b a b

a b a bm m m m m

Beta

( )( ) (ln , ln(1 )) , ( 1, 1) , ( ) 1, ( , ) ,

( ) ( )

( ) ( )( , ) ln ( ) ln

( )

T T a bu a b h g a b

a b

a bA a b g

a b

m m m m

Exponential Family: The Beta Distribution

( | ) ( ) ( )exp ( )

( )exp ( ) ( )

T

T

p h g u

h u A

x x x

x x

https://en.wikipedia.org/wiki/Beta_distribution


Exponential Family: The Gaussian Consider the univariate Gaussian

Comparing this with our exponential family:

we can indentify (this is a two parameter distribution):

9

22 2 2

2 2 2 22 2

1 ( ) 1 1 1( | , ) exp exp

2 2 22 2

xp x x x

m mm m

2

1/22 122 2

2

1 1( ) ( , ) , ( , ) , ( ) , ( ) 2 exp

2 42

T Tu x x x h x gm

2

12

2

1( ) ln ( ) ln 2

2 4A g

( | ) ( ) ( ) exp ( ) ( ) exp ( ) ( )T Tp h g u h u A x x x x x

https://en.wikipedia.org/wiki/Normal_distribution


Conjugate Priors In general, for a given probability distribution 𝑝(𝒙|𝜼), we

can seek a prior 𝑝(𝜼) that is conjugate to the likelihood

function, so that the posterior distribution has the same

functional form as the prior.

For the Bernoulli, the conjugate prior is the Beta

distribution

For the Gaussian, the conjugate prior for the mean is a

Gaussian, and the conjugate prior for the precision is

the Wishart distribution

10

https://en.wikipedia.org/wiki/Wishart_distribution


Conjugate Priors For any member of the exponential family with likelihood

there exists a conjugate prior that can be written in the form

In normalized form, we write:

11

( | ) ( ) ( ) exp ( )Tp h g uq q qx x x

𝑝(𝜽|𝜈0, 𝝉0) ∝ 𝑔 𝜼 𝜽𝜈0exp 𝜼𝑇 𝜽 𝝉0 = exp 𝜈0𝜼

𝑇 𝜽 ҧ𝜏0 − 𝐴 𝜼 𝜽 𝜈0 , 𝑤ℎ𝑒𝑟𝑒: 𝝉0 ≡ 𝜈0ത𝝉0

𝑝(𝜽|𝜈0, 𝝉0) =1

𝑍 𝜈0, 𝝉0𝑔 𝜼 𝜽

𝜈0exp 𝜼𝑇 𝜽 𝝉0 =

1

𝑍 𝜈0, 𝝉0exp 𝜈0𝜼

𝑇 𝜽 ത𝝉0 − 𝐴 𝜼 𝜽 𝜈0

𝑤ℎ𝑒𝑟𝑒: 𝑍 𝜈0, 𝝉0 = නexp 𝜈0𝜼𝑇 𝜽 ത𝝉0 − 𝐴 𝜼 𝜽 𝜈0 𝑑𝜽


Conjugate Priors

Using , the posterior becomes (this form justifies ):

The parameter 𝜈0 can be interpreted as effective number of fictitious

observations in the prior each of which has a value for the sufficient

statistic equal to .

12

11 1

( | ) ( ) exp ( ) ( ) exp ( )N N N

NT T

n n n n

nn n

p h g u h g u

q q q q qX x x x x

𝑝(𝜽|𝜈0, 𝝉𝟎) =1

𝑍 𝜈0, 𝝉0𝑔 𝜼 𝜽

𝜈0exp 𝜼𝑇 𝜽 𝝉0 =

1

𝑍 𝜈0, 𝝉0exp 𝜈0𝜼

𝑇 𝜽 ത𝝉0 − 𝐴 𝜼 𝜽 𝜈0

𝑝(𝜽|𝑿, 𝝌, 𝜈) ∝ 𝑔 𝜼 𝜽𝜈0+𝑁

exp 𝜼𝑇 𝜽

𝑛=1

𝑁

𝒖(𝒙𝑛) + 𝜈0ത𝝉0 = 𝑔 𝜼 𝜽𝜈0+𝑁

exp 𝜼𝑻 𝜽 𝑁ഥ𝒖 + 𝜈0ത𝝉0

ഥ𝒖 =1

𝑁

𝑛=1

𝑁

)𝒖(𝒙𝑛 ത𝝉0

ത𝝉0

𝑝 𝜽|𝑿, 𝜈𝑁, 𝝉𝑵 =1

)𝑍(𝜈𝑁, 𝝉𝑁𝑔 𝜼 𝜽

𝜈𝑁exp 𝑁 + 𝑣0 𝜼𝑇 𝜽

𝑁ഥ𝒖 + 𝜈0ത𝝉0𝑁 + 𝑣0

=1

)𝑍(𝜈𝑁, 𝜏𝑁𝑔 𝜼 𝜽

𝜈𝑁exp 𝜈𝑁𝜼

𝑇 𝜃 ത𝝉𝑁 ,

𝑤ℎ𝑒𝑟𝑒 𝜈𝑁 = 𝜈0 + 𝑁, ത𝝉𝑁 =𝑁ഥ𝒖 + 𝜈0ത𝝉0𝑁 + 𝑣0

, 𝝉𝑁 = 𝜈𝑁ത𝝉𝑁 = 𝑁ഥ𝒖 + 𝜈0ത𝝉0 =

𝑖=1

𝑁

)𝒖(𝑥𝑖 + 𝝉0


Posterior Predictive Let the posterior predictive is then:

This is simplified as follows:

If 𝑁 = 0, this becomes the marginal likelihood of 𝑋′, which reduces to

the normalizer of the posterior divided by the normalizer of the prior

multiplied by a constant.

13

' '

'' ' '

0

1 0 0

( | ) ( | ) ( | )

1( ) ( ) exp ( ) exp ( )

( ( ) )

N

NN T T

i

i

p p p d

h g g dZ N

X X X X

x u X u Xu X

q q q

q q q q

'

''

0 0'

1 0

''' '

0

1 0 0

0

1( ) exp ( ) ( )( | )

', (

, ( )

) ( )( )

, ( )

N

i

T

N

i

NN

i

i

p

Z N Nh

Z N

h g dZ N

x u X u XX Xu

u X u X

u X

X

x

q q q

'' '

1 1

( ) ( ), ( ) ( ),N N

i i

i i

u X u x u X u x


Beta/Bernoulli: Posterior Predictive Consider a Bernoulli likelihood with a Beta prior. The likelihood takes

the familiar exponential distribution form:

The conjugate prior is a Beta:

Thus the posterior becomes:

Let 𝑠 the number of heads in the past data. The probability of

future heads in 𝑚 trials is then:

14

( 1)

| (1 ) exp log1

(1 )i i

i i

i

i

N

i

i

s x

x N x

p xqq

q qqq

D

| , , , , ( 1)N N N N i

i

p s N s s xq D Beta

1 1' ' '| , (1 ) (1 )N NN N N N N m N ms m s

N N N N N m N m

p s m d

q q q q q

D

0 0 0| (1 )s N s

p q q q

D

'

1

' ( 1)m

i

i

s x

', 'N m N N m Ns m s

0 0 0 0 0| , , , 1, 1,p q Beta

00 00

0 0 0| , 1 exp lo1 g1

p q q

qq q

q


Computing Moments of Sufficient Statistics u(x)

Differentiate wrt the for the exponential family:

The above equation can be further simplified if written in terms of the

partition function 𝑍 = 1/𝑔(𝜂) or 𝐴 = 𝑙𝑜𝑔𝑍 = −𝑙𝑜𝑔𝑔(𝜂):

Let us re-write explicitly the above equation as:

We can compute the variance of 𝒖(𝒙) by differentiating the Eq.

above with respect to 𝜼.

15

( | ) ( ) ( ) exp ( ) 1Tp d h g u d x x x x x

( ) ( )exp ( ) ( ) ( )exp ( ) ( ) 0T Tg h d g h u d x u x x x x u x x

( | ) 1p d x x

( )

( ) ( ) exp ( ) ( ) ( | ) ( ) ( )( )

Tgg h u d p d

g

= x x u x x x u x x u x

( ) ( )A u x

( ) ( ) ( ) exp ( ) ( )TA g h u d x x u x x



Thus the covariance of 𝑢(𝒙) can be expressed in terms of the

2nd derivatives of 𝐴(𝜼) and similarly for higher order moments.

Provided we can normalize a distribution from the exponential

family, we can always find its moments by simple

differentiation.

16

2 ( ) cov ( ) ( )A so A is convex u x

2

( ) ( ) ( ) ( )

( ) ( ) ( ) exp ( ) ( ) ( ) ( ) exp ( ) ( ) ( )

T T

T T TA g h u d g h u d

u x u x u x u x

x x u x x x x u x u x x

( ) ( ) ( )exp ( ) ( )TA g h u d x x u x x



Let us check these relations for the Univariate Gaussian:

17

2 ( ) cov ( ) ( )A so A is convex u x

( ) ( )A u x

2

212 2 2

2

1 1( ) ln 2 , ( , ) , ( ) ( , )

2 4 2

T TA u x x x m

22 2 21 1

2

1 2 2 2 2

22

2

1 2

( ) ( ) 1,

2 2 4

( ) 1var , .

2

A AX X

AX etc

m m

𝑔(𝜂) = −2𝜂2Τ1 2exp

𝜂12

4𝜂2


We have shown that we can compute the mean of the

distribution 𝜇 = 𝐸[𝑢(𝒙)] in terms of the canonical

parameter :

We have also shown that 𝐴(𝜂) is

a convex function. Since for a

convex function there is one-to-one

relation between the argument of

the function and its derivative, the

mapping 𝜇(𝜂) is invertable.

Thus the exponential family of distributions can also be

parameterized in terms of 𝜇 (moment parametrization)

exactly as we started this course.

18

Moment Parametrization

( ) ( )Am u x


MLE for the Exponential Family The joint density for a data 𝑿 = {𝒙𝟏, . . . , 𝒙𝒏} is itself an exp. distribution

with sufficient statistics :

The exponential family is the only family of distributions with finite

sufficient statistics (size independent of the data set size).

The log likelihood is concave (𝐴 convex) and has a unique maximum.

Maximizing wrt 𝜼 gives:

At the MLE, the empirical average of the sufficient statistic is equal the

model’s theoretical expected sufficient statistics (moment matching).

Thus to find the expected value of the sufficient statistics, one can use

directly the data without having to estimate 𝜼. When 𝒖(𝒙) = 𝒙, the

above allows us to compute the expectation of 𝒙 directly from the data.

19

11 1

1 1 1 1

( | ) ( ) ( ) exp ( ) ( ) ( ) exp ( )

ln ( | ) ( ) ln ( ) ( ) ( ) ( ) ( )

N N NT N T

n n n n

nn n

N N N NT T

n n n n

n n n n

p h g u h g u

p h N g u h NA u

X x x x x

X x x x x

1( )

N

nnu

x

1 1

1 1( ) ( ) ( ) ( )

N N

ML n n

n n

AN N

u x u x u x


MLE for the Exponential Family

Using the sufficient statistic, one can in principle invert the above equ.

to compute 𝜼𝑀𝐿𝐸. For example, for the Bernoulli distribution,

and thus:

and

20

1

1( ) ( ) ( )

N

ML n

n

A uN

u x x

( | ) ( ) exp , ( ) , ( ) 1,

1 1, ( ) , ln

1 1 1

p x g x u x x h x

ge e

mm

m

𝜂𝑀𝐿𝐸 = lnҧ𝜇

1 − ҧ𝜇

𝔼 𝑋 = 𝑝(𝑋 = 1) = ҧ𝜇 ≡ 𝜇𝑀𝐿𝐸 =1

𝑁

𝑛=1

𝑁

𝕀 𝑥𝑛 = 1


MLE and Kullback-Leibler Distance A useful property for the MLE (and not just a property for the

exponential family of distributions) is the following:

Minimizing the KL distance to the empirical distribution is equivalent to

maximizing the likelihood.

Indeed, let us consider the model 𝑙𝑜𝑔𝑝(𝑥|𝜃) and the empirical

distribution:

We can then derive the following:

and from this:

Since the 1st term is independent of 𝜃, the assertion is proved.

21

1

1( ) ,

N

emp n

n

p x x xN

1 1

1 1 1( ) log ( | ) , log ( | ) log ( | ) ( | )

N N

emp n n

x n x n

p x p x x x p x p xN N N

q q q q

D

( )

( ), ( | ) ( ) log ( ) log ( ) ( ) log ( | )( | )

1( ) log ( ) ( | )

emp

emp emp emp emp emp

x x x

emp emp

x

p xKL p x p x p x p x p x p x p x

p x

p x p xN

q qq

q

D


Maximum Entropy and Exponential Family If nothing is known about a distribution except that it belongs to a

certain class, then the distribution with the largest entropy should be

chosen as the default.a

The entropy is defined as

discrete case

When some statistics (moments) of the distribution are known,

the maximum entropy distribution is of the form (𝜆’s are the Lagrange

multipliers enforcing the constraints):

Thus the MaxEnt distribution has the form of the exponential family.

22

( ) ( ) log ( )k k

k

q q

1

1

exp ( )

( ) ,

exp ( )

K

k k i

k

i kK

k k j

j k

g

Lagrange multipliers

g

q

q

q

, 1,..,k kg w k K q

a C. P. Robert, The Bayesian Choice, Springer, 2nd edition, chapter 3 (full text available)

http://www.ceremade.dauphine.fr/~xian/coursBC.pdf

http://cornell.worldcat.org/oclc/187082900?referer=di&ht=edition


Generalized Linear Models

23

We now study regression given data 𝑋 and 𝑌 and a GLIM.

We choose a particular conditional expectation of 𝑌. We denote the

modeled value of conditional expectation as 𝑚 = 𝑓(𝜽𝑇𝒙), 𝜉 = 𝜽𝑇𝒙.

For linear regression, GLIM extends these ideas beyond the

Gaussian, Bernoulli and multinomial setting to the more general

exponential family.

𝑋 enters linearly as 𝜽𝑇𝒙 and 𝒇 is called a response

function. Y is a one-to-one map of m to .

To specify a GLIM we need (a) a choice of exponential family

distribution, and (b) a choice of the response function 𝑓 ∙ .

Choosing the exponential family distribution is strongly constrained

by the nature of the data.

Note that 𝑓 ∙ needs to be both monotonic and differentiable.

However, there is a particular response function (canonical response

function) that is uniquely associated with a given exponential family

distribution.


Canonical Response Function Canonical response function:

𝑓 ∙ = Y-1(∙) x =

If we decide to use the canonical response function, the choice of the

exponential family density completely determines the GLIM.

24

1fx m m Y


MLE & Canonical Response Function

25

Consider a regression problem with data 𝓓 = {(𝒙𝒊, 𝑦𝑖}, 𝑖 =1,…𝑁}. The log likelihood for a GLIM is as:

For a canonical response, 𝜂 = 𝜉 = 𝜽𝑇𝒙 , and this is simplified as:

Regardless of 𝑁, the size of the sufficient statistic is fixed: the

dimension of 𝒙𝑛 - important reason for using canonical response.

This is a general expression for GLM with exponential family

distributions and the canonical response function.

1 1

, log ( ) ( ) , :

n

N NT

n n n n n n n n

n n

h y y A where f with

m

q m x x

xqD

1 1 1

, log ( ) ( )N N N

T

n n n n

n n n

Sufficient statistic for

h y y Aq

x

q

qD

1 1 1

, '( ) ,

n

N N NT

n n n n n n n n n

n n n

y A y y or m m

x

x X yq q q qq q mD D


Iterative Reweighted Least Squares (IRLS)

26

The Hessian can now be computed from

as:

To estimate parameters in the canonical response function choice,

one can use the iteratively reweighted least squares (IRLS) algorithm

The batch Newton algorithm now takes the familiar IRLS form:

For non-canonical response functions, the Hessian has an extra term

that contains the factor (𝒚 − 𝝁). When we take expectations this term

vanishes! So using the expected Hessian in the Newton method the

algorithm looks essentially the same (Fisher Scoring algorithm).

1

, ,N

T

n n n

n

y orm

x X yq qq q mD D

2 2 1

1 1

, , ,...,N

T Tn nn n

n n n

d ddor , where :

d d d

m mm

H x x X WX W =D Dq qq q

1 1

1 11

1 1

t

t t T t T t T t T t t T t

T t T t t t t T t T t t t t

X W X X y X W X X W X X y

X W X X W X W y X W X X W W y

q q m q m

q m m

https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares

https://en.wikipedia.org/wiki/Newton's_method_in_optimization


Sequential Estimation - LMS An on-line estimation algorithm can be obtained by following the

stochastic gradient of the log likelihood function.

If we do not use the canonical response function, then the gradient

also includes the derivatives of 𝑓(∙) and Ψ(∙). These can be viewed

as scaling coefficients that alter the step size, but otherwise leave the

general LMS form intact.

The LMS algorithm is the generic stochastic gradient algorithm for

models throughout the GLIM family.

27

1 ,Tt t t t t

n n n n ny x f x m m q q q

exponential family of distributions · 2019-02-06 · exponential family the exponential family of...

Documents