a gentle introduction to bayesian nonparametrics

A gentle introduction to BNPPart I

Antonio Canale

Universita di Torino &Collegio Carlo Alberto

StaTalk on BNP, 19/02/16

Introduction The Dirichlet process Nonparametric mixture models

Outline of the talk(s)

1 Why BNP? (A)

2 The Dirichlet process (A)

3 Nonparametric mixture models (A)

4 Beyond the DP (J)

5 Species sampling processes (J)

6 Completely random measures (J)


Why Bayesian nonparametrics (BNP)?

Why nonparametric?

• We don’t want to strictly impose any model but let the data speak;

• The idea of a true model governed by relatively few parameters isunrealistic;

Why Bayesian?

• If we have a reasonable guess for what is the true model we wantto use this prior knowledge.

• Large support and consistency are interesting concepts related topriors on infinite dimensional spaces (Pierpaolo’s talk in theafternoon)

BNP is to fit a single model that can adapt its complexity to thedata.


How Bayesian and nonparametric?

Define F the space of densities and let P ∈ F . A Bayesian analysisstarts with

y ∼ P

P ∼ π

where π is a measure on the space F .Hence BNP is infinitely parametric.


The Dirichlet distribution

• Start with independent Zj ∼ Ga(αj , 1), for j = 1, . . . , k (αj > 0)

• Define

πj =Zj∑kj=1 Zj

;

• Then (π1, . . . , πk) ∼ Dir(α1, . . . , αk);

• The Dirichlet distribution is a distribution over the K -dimensionalprobability simplex:

∆k = {(π1, . . . , πk) : πj > 0,∑j

πj = 1}


The Dirichlet distribution

• Probability density

p(π1, . . . , πk |α) =Γ(∑

j αj)∏j Γ(αj)

∏j

παj−1j


The Dirichlet distribution in Bayesian statistics

Dirichlet distribution is conjugate to the multinomial likelihood, henceif

π ∼ Dir(α)

y |π ∼ Multinomial(π)

p(y = j |π) = πj ,

then we havep(π|y = j , α) = Dir(α)

where αj = αj + 1, αi = αi for each i 6= j .


Agglomerative property of Dirichlet distributions

• Combining entries by their sum

(π1, . . . , πk) ∼ Dir(α1, . . . , αk)

→ (π1, . . . , πi + πj , . . . , πk) ∼ Dir(α1, . . . , αi + αj , . . . , αk)

• Marginals follow Beta distributions, πj ∼ beta(αj ,∑

h 6=j αh).


1 Introduction

2 The Dirichlet process

3 Nonparametric mixture models


Ferguson (1973) definition of the Dirichlet process

Definition

• P is a random probability measure over (Y,B(Y)).

• F is the whole space of probability measures on (Y,B(Y)), soP ∈ F .

• Let α ∈ R+ and P0 ∈ F .

• P ∼ DP(α,P0) iff for any n and any partition B1, . . . ,Bn of Y

(P(B1),P(B2), . . . ,P(Bn)) ∼ Dir(αP0(B1), αP0(B2), . . . , αP0(Bn))

The DP is a distribution of random probability distributions.


Interpretation

If P ∼ DP(α,P0), then for any measurable A

• E (P(A)) = P0(A)

• Var(P(A)) = P0(A){1− P0(A)}/(1 + α)


Density estimation using DP priors

If yiiid∼ P for i = 1, . . . , n and P ∼ DP(α,P0) a priori then,

P|y ∼ DP

(n + α,

1

α + n

n∑i=1

δyi +α

α + nP0

)


Density estimation using DP priors

−6 −4 −2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

x

−6 −4 −2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

x

Figure: Black true density (N(1, 2)), blue base measure (N(0,1)), greendashed ECDF, blue dashed posterior DP. First plot n = 10, second n = 50.


Stick-breaking

An alternative representation of the DP is related to the so calledstick-breaking process:


Stick-breaking representation of the DP

To obtain P ∼ DP(αP0):

• Draw a sequence of Beta random variables Vjiid∼ Beta(1, α).

• Define a sequence of weights as πj = Vj∏

l<j(1− Vl)

• Draw independent θiid∼ P0

• Define

P =∞∑j=1

πjδθj


Stochastic processes and chinese restaurants. . .

Imagine a Chinese restaurant with countably infinitely many tables,labelled 1, 2, . . .Customers walk in and sit down at some table. The tables are chosenaccording to the following random process.

1 The first customer sits at table 1;

2 The n-th customer chooses the first unoccupied table withprobability α/(α + n − 1) and an occupied table with probabilitynj/(α + n − 1), where nj is the number of people sitting at thattable.


CRP or Polya urn construction of the DP

If θiid∼ P0 and P ∼ DP(αP0), integrate out P and obtain

pr(θi |θ1, . . . , θi−1) =∑j

njn + α

δθj +α

n + αP0.

Obtaining that (θ1, . . . , θn) ∼ PU(αP0).


Considerations

• Draw from a DP are a.s. discrete

• Unappealing if y is continuous, useful if y is discrete? (no, butwait for my afternoon talk)


Finite mixture models

Assume the following model

yi ∼ N(µSi , σ2Si

), pr(Si = h) = πh

with likelihood

f (y |µ, σ2, π) =k∑

j=1

πjφ(y ;µj , σ2j )

and prior(µ, σ2) ∼ P0, π ∼ Dir(α);


FMM applications: density estimation

• With enough components, a mixture ofGaussian can approximate anycontinuous distribution.

• If the number of components equals nwe have the kernel density estimation.

0 1 2 3 4 5 6

0.0

0.1

0.2

0.3

0.4

0.5

geyser$duration

Pro

babi

lity

dens

ity fu

nctio

n


FMM applications: model-based clustering

• Divide observations into homogeneusclusters

• “Homogeneus” depends on whatkernel (Gaussian in previous slide)

• With Gaussian kernel, there are twoclusters in Iris dataset (truth is three!)

• See discussions in Petralia et al.(2012), Canale and Scarpa (2015) andCanale and De Blasi (2015)

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

2.0

2.5

3.0

3.5

4.0

iris$Sepal.Length

iris$

Sep

al.W

idth


Infinite mixture models

• A more elegant way to write the finite mixture model is

f (y) =

∫K (y ; θ)dP(θ), P =

K∑j=1

ωjδθj ,

where K (·; θ) is a general kernel (e.g. normal) parametrized by θ.

• Clearly a prior on the weights and on the parameters of the kernelis equivalent to a prior on the finite disrete measure P.

• From FMM to IMM ⇒ P ∼ DP(αP0)!


DP mixture models

• The model and prior are

y ∼ f , f (y) =

∫K (y ; θ)dP(θ), P ∼ DP(αP0).

where K (·; θ) is a general kernel (e.g. normal) parametrized by θ.

• Consider the DPM prior as a “smoothed version” of the DP prior(just like the kernel density estimation is a smoothed version of thehistogram)

• Widely used for continuous distribution.


Hyerarchical representation

Using a hyerarchical representation the mixture model can beexpressed as

yi | θi ∼ K (y ; θi )

θi ∼ P

P ∼ DP(αP0).


Mixture of Gaussians

• Gold standard for density estimation;

• can approximate any continuous distribution (Lo, 1984; Escobarand West, 1995);

• large support and good frequentist properties (Ghosal et al., 1999).

The model and the prior are

f (y) =

∫N(y ;µ, τ−1)dP(µ, τ−1),

P ∼ DP(αP0),

where N(y ;µ, τ−1) is a normal kernel having mean µ and precision τ ,P0 Normal-Gamma, for conjugacy.


Mixture of Gaussians

yi | µi , τi ∼ N(µi , τ−1i )

(µi , τi ) ∼ P

P ∼ DP(αP0).


Complex data

• Mixture models can be used also when we have complex (modern)data

• An example is functional data f1, . . . , fn

fi (t) = η(t) + εit ,

where η is a smooth function in t and εit are random noises.

• we can model these data with

fi | ηi ∼ N(ηi , σ2)

ηi ∼ P

P ∼ DP(αP0).

a gentle introduction to bayesian nonparametrics

Data & Analytics