a gentle introduction to bayesian nonparametrics
TRANSCRIPT
A gentle introduction to BNPPart I
Antonio Canale
Universita di Torino &Collegio Carlo Alberto
StaTalk on BNP, 19/02/16
Introduction The Dirichlet process Nonparametric mixture models
Outline of the talk(s)
1 Why BNP? (A)
2 The Dirichlet process (A)
3 Nonparametric mixture models (A)
4 Beyond the DP (J)
5 Species sampling processes (J)
6 Completely random measures (J)
Introduction The Dirichlet process Nonparametric mixture models
Why Bayesian nonparametrics (BNP)?
Why nonparametric?
• We don’t want to strictly impose any model but let the data speak;
• The idea of a true model governed by relatively few parameters isunrealistic;
Why Bayesian?
• If we have a reasonable guess for what is the true model we wantto use this prior knowledge.
• Large support and consistency are interesting concepts related topriors on infinite dimensional spaces (Pierpaolo’s talk in theafternoon)
BNP is to fit a single model that can adapt its complexity to thedata.
Introduction The Dirichlet process Nonparametric mixture models
Why Bayesian nonparametrics (BNP)?
Why nonparametric?
• We don’t want to strictly impose any model but let the data speak;
• The idea of a true model governed by relatively few parameters isunrealistic;
Why Bayesian?
• If we have a reasonable guess for what is the true model we wantto use this prior knowledge.
• Large support and consistency are interesting concepts related topriors on infinite dimensional spaces (Pierpaolo’s talk in theafternoon)
BNP is to fit a single model that can adapt its complexity to thedata.
Introduction The Dirichlet process Nonparametric mixture models
Why Bayesian nonparametrics (BNP)?
Why nonparametric?
• We don’t want to strictly impose any model but let the data speak;
• The idea of a true model governed by relatively few parameters isunrealistic;
Why Bayesian?
• If we have a reasonable guess for what is the true model we wantto use this prior knowledge.
• Large support and consistency are interesting concepts related topriors on infinite dimensional spaces (Pierpaolo’s talk in theafternoon)
BNP is to fit a single model that can adapt its complexity to thedata.
Introduction The Dirichlet process Nonparametric mixture models
Why Bayesian nonparametrics (BNP)?
Why nonparametric?
• We don’t want to strictly impose any model but let the data speak;
• The idea of a true model governed by relatively few parameters isunrealistic;
Why Bayesian?
• If we have a reasonable guess for what is the true model we wantto use this prior knowledge.
• Large support and consistency are interesting concepts related topriors on infinite dimensional spaces (Pierpaolo’s talk in theafternoon)
BNP is to fit a single model that can adapt its complexity to thedata.
Introduction The Dirichlet process Nonparametric mixture models
Why Bayesian nonparametrics (BNP)?
Why nonparametric?
• We don’t want to strictly impose any model but let the data speak;
• The idea of a true model governed by relatively few parameters isunrealistic;
Why Bayesian?
• If we have a reasonable guess for what is the true model we wantto use this prior knowledge.
• Large support and consistency are interesting concepts related topriors on infinite dimensional spaces (Pierpaolo’s talk in theafternoon)
BNP is to fit a single model that can adapt its complexity to thedata.
Introduction The Dirichlet process Nonparametric mixture models
Why Bayesian nonparametrics (BNP)?
Why nonparametric?
• We don’t want to strictly impose any model but let the data speak;
• The idea of a true model governed by relatively few parameters isunrealistic;
Why Bayesian?
• If we have a reasonable guess for what is the true model we wantto use this prior knowledge.
• Large support and consistency are interesting concepts related topriors on infinite dimensional spaces (Pierpaolo’s talk in theafternoon)
BNP is to fit a single model that can adapt its complexity to thedata.
Introduction The Dirichlet process Nonparametric mixture models
How Bayesian and nonparametric?
Define F the space of densities and let P ∈ F . A Bayesian analysisstarts with
y ∼ P
P ∼ π
where π is a measure on the space F .Hence BNP is infinitely parametric.
Introduction The Dirichlet process Nonparametric mixture models
The Dirichlet distribution
• Start with independent Zj ∼ Ga(αj , 1), for j = 1, . . . , k (αj > 0)
• Define
πj =Zj∑kj=1 Zj
;
• Then (π1, . . . , πk) ∼ Dir(α1, . . . , αk);
• The Dirichlet distribution is a distribution over the K -dimensionalprobability simplex:
∆k = {(π1, . . . , πk) : πj > 0,∑j
πj = 1}
Introduction The Dirichlet process Nonparametric mixture models
The Dirichlet distribution
• Probability density
p(π1, . . . , πk |α) =Γ(∑
j αj)∏j Γ(αj)
∏j
παj−1j
Introduction The Dirichlet process Nonparametric mixture models
The Dirichlet distribution in Bayesian statistics
Dirichlet distribution is conjugate to the multinomial likelihood, henceif
π ∼ Dir(α)
y |π ∼ Multinomial(π)
p(y = j |π) = πj ,
then we havep(π|y = j , α) = Dir(α)
where αj = αj + 1, αi = αi for each i 6= j .
Introduction The Dirichlet process Nonparametric mixture models
Agglomerative property of Dirichlet distributions
• Combining entries by their sum
(π1, . . . , πk) ∼ Dir(α1, . . . , αk)
→ (π1, . . . , πi + πj , . . . , πk) ∼ Dir(α1, . . . , αi + αj , . . . , αk)
• Marginals follow Beta distributions, πj ∼ beta(αj ,∑
h 6=j αh).
Introduction The Dirichlet process Nonparametric mixture models
1 Introduction
2 The Dirichlet process
3 Nonparametric mixture models
Introduction The Dirichlet process Nonparametric mixture models
Ferguson (1973) definition of the Dirichlet process
Definition
• P is a random probability measure over (Y,B(Y)).
• F is the whole space of probability measures on (Y,B(Y)), soP ∈ F .
• Let α ∈ R+ and P0 ∈ F .
• P ∼ DP(α,P0) iff for any n and any partition B1, . . . ,Bn of Y
(P(B1),P(B2), . . . ,P(Bn)) ∼ Dir(αP0(B1), αP0(B2), . . . , αP0(Bn))
The DP is a distribution of random probability distributions.
Introduction The Dirichlet process Nonparametric mixture models
Interpretation
If P ∼ DP(α,P0), then for any measurable A
• E (P(A)) = P0(A)
• Var(P(A)) = P0(A){1− P0(A)}/(1 + α)
Introduction The Dirichlet process Nonparametric mixture models
Density estimation using DP priors
If yiiid∼ P for i = 1, . . . , n and P ∼ DP(α,P0) a priori then,
P|y ∼ DP
(n + α,
1
α + n
n∑i=1
δyi +α
α + nP0
)
Introduction The Dirichlet process Nonparametric mixture models
Density estimation using DP priors
−6 −4 −2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0
x
−6 −4 −2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0
x
Figure: Black true density (N(1, 2)), blue base measure (N(0,1)), greendashed ECDF, blue dashed posterior DP. First plot n = 10, second n = 50.
Introduction The Dirichlet process Nonparametric mixture models
Stick-breaking
An alternative representation of the DP is related to the so calledstick-breaking process:
Introduction The Dirichlet process Nonparametric mixture models
Stick-breaking representation of the DP
To obtain P ∼ DP(αP0):
• Draw a sequence of Beta random variables Vjiid∼ Beta(1, α).
• Define a sequence of weights as πj = Vj∏
l<j(1− Vl)
• Draw independent θiid∼ P0
• Define
P =∞∑j=1
πjδθj
Introduction The Dirichlet process Nonparametric mixture models
Stochastic processes and chinese restaurants. . .
Imagine a Chinese restaurant with countably infinitely many tables,labelled 1, 2, . . .Customers walk in and sit down at some table. The tables are chosenaccording to the following random process.
1 The first customer sits at table 1;
2 The n-th customer chooses the first unoccupied table withprobability α/(α + n − 1) and an occupied table with probabilitynj/(α + n − 1), where nj is the number of people sitting at thattable.
Introduction The Dirichlet process Nonparametric mixture models
CRP or Polya urn construction of the DP
If θiid∼ P0 and P ∼ DP(αP0), integrate out P and obtain
pr(θi |θ1, . . . , θi−1) =∑j
njn + α
δθj +α
n + αP0.
Obtaining that (θ1, . . . , θn) ∼ PU(αP0).
Introduction The Dirichlet process Nonparametric mixture models
Considerations
• Draw from a DP are a.s. discrete
• Unappealing if y is continuous, useful if y is discrete? (no, butwait for my afternoon talk)
Introduction The Dirichlet process Nonparametric mixture models
Considerations
• Draw from a DP are a.s. discrete
• Unappealing if y is continuous, useful if y is discrete? (no, butwait for my afternoon talk)
Introduction The Dirichlet process Nonparametric mixture models
Finite mixture models
Assume the following model
yi ∼ N(µSi , σ2Si
), pr(Si = h) = πh
with likelihood
f (y |µ, σ2, π) =k∑
j=1
πjφ(y ;µj , σ2j )
and prior(µ, σ2) ∼ P0, π ∼ Dir(α);
Introduction The Dirichlet process Nonparametric mixture models
FMM applications: density estimation
• With enough components, a mixture ofGaussian can approximate anycontinuous distribution.
• If the number of components equals nwe have the kernel density estimation.
0 1 2 3 4 5 6
0.0
0.1
0.2
0.3
0.4
0.5
geyser$duration
Pro
babi
lity
dens
ity fu
nctio
n
Introduction The Dirichlet process Nonparametric mixture models
FMM applications: model-based clustering
• Divide observations into homogeneusclusters
• “Homogeneus” depends on whatkernel (Gaussian in previous slide)
• With Gaussian kernel, there are twoclusters in Iris dataset (truth is three!)
• See discussions in Petralia et al.(2012), Canale and Scarpa (2015) andCanale and De Blasi (2015)
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
2.0
2.5
3.0
3.5
4.0
iris$Sepal.Length
iris$
Sep
al.W
idth
Introduction The Dirichlet process Nonparametric mixture models
Infinite mixture models
• A more elegant way to write the finite mixture model is
f (y) =
∫K (y ; θ)dP(θ), P =
K∑j=1
ωjδθj ,
where K (·; θ) is a general kernel (e.g. normal) parametrized by θ.
• Clearly a prior on the weights and on the parameters of the kernelis equivalent to a prior on the finite disrete measure P.
• From FMM to IMM ⇒ P ∼ DP(αP0)!
Introduction The Dirichlet process Nonparametric mixture models
DP mixture models
• The model and prior are
y ∼ f , f (y) =
∫K (y ; θ)dP(θ), P ∼ DP(αP0).
where K (·; θ) is a general kernel (e.g. normal) parametrized by θ.
• Consider the DPM prior as a “smoothed version” of the DP prior(just like the kernel density estimation is a smoothed version of thehistogram)
• Widely used for continuous distribution.
Introduction The Dirichlet process Nonparametric mixture models
Hyerarchical representation
Using a hyerarchical representation the mixture model can beexpressed as
yi | θi ∼ K (y ; θi )
θi ∼ P
P ∼ DP(αP0).
Introduction The Dirichlet process Nonparametric mixture models
Mixture of Gaussians
• Gold standard for density estimation;
• can approximate any continuous distribution (Lo, 1984; Escobarand West, 1995);
• large support and good frequentist properties (Ghosal et al., 1999).
The model and the prior are
f (y) =
∫N(y ;µ, τ−1)dP(µ, τ−1),
P ∼ DP(αP0),
where N(y ;µ, τ−1) is a normal kernel having mean µ and precision τ ,P0 Normal-Gamma, for conjugacy.
Introduction The Dirichlet process Nonparametric mixture models
Mixture of Gaussians
yi | µi , τi ∼ N(µi , τ−1i )
(µi , τi ) ∼ P
P ∼ DP(αP0).
Introduction The Dirichlet process Nonparametric mixture models
Complex data
• Mixture models can be used also when we have complex (modern)data
• An example is functional data f1, . . . , fn
fi (t) = η(t) + εit ,
where η is a smooth function in t and εit are random noises.
• we can model these data with
fi | ηi ∼ N(ηi , σ2)
ηi ∼ P
P ∼ DP(αP0).