chinese restaurant process
TRANSCRIPT
Chinese Restaurant ProcessMohitdeep Singh
ML , April 29th, 2015
Outline
• General Introduction
• Chinese Restaurant Process
• Build up to non-parametric Bayesians
• Chinese Restaurant Franchise (if time)
• Demo
Motivation
• Where do you start• How do you start• Unsupervised Learning Techniques ??
The ML story
MLE/MAP estimation
Decision Rule/ Probability Distributions/tree..
Intractable to directly estimate(EM techniques)
Machine Learning
Algorithms/DS etc
Reasoning uncertainty
Statistics
Bayesian vs Frequentist
Bayesian vs Frequentist
Frequentist Bayesian
Parametric Logistic Regression, FisherDiscriminant Analysis ..
Graphical Models…
Non-Parametric KNN, kernel approaches, decision trees
Gaussian Process, DirichletProcess…
Clustering
• Fundamental problem in machine learning
• Where are the clusters?
• How many clusters (parameter k)
LDA
w: word represented by multinomial random variablez: topic allocation represented as multinomial random variableΘ: document model as Dirichlet random variableα & β are random variables(hyper-parameters)
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
Non-Parameteric Bayesian
• Fundamental equation:
posterior ∝ prior X likelihood
• If θ is euclidean parameter
p(θ|x) ∝ p(θ) p(x|θ)
• Introduce G (stochastic process)
p(G|x) ∝ p(G) p(x|G)
Chinese Restaurant Process
• A random process where task is analogous of seating customers in Chinese Restaurant with infinite number of tables.
• First person sits in first table (deterministic)
• nth person can sits at a table based on following process:
Join existing table ∝ ni
Join an empty table ∝ αo
CRP and clustering
ϕ1ϕ2 ϕ3
Data points are customersTables are clustersPrior: First person to sit at table k chooses a parameter vector ϕk for that table (P(G))Likelihood: Associate the data points with parameter of the table (P(x|G)).
Posterior: Turn the bayesian crank. P(G|x).
Exchangebility
• As a prior on partition of data, CRP is exchangeable process.
• Concept introduced by Haag, popularized by de-Finetti.
• A sequence is exchangeable if its joint probability function is symmetric function of its n arguments.
Polya-urn model(more later)
θn|θ1 ….θn-1 ~ αoGo + Σ δθi
Polya-urn model
Consider an urn with g green balls and r red balls. Draw a ball at random and note its color. Fix a number a and replace the ball you observed with a balls of same color.
Let Xi = 1 if i-th draw yield green ball else 0.
p(1,1,0,1) = g (g+a) r g + 2a
(g+r) (g+r+a) (g+r+2a) (g+r+3a)
= p(0,1,1,1)
“The” De Fenetti Theorem
Theorem (De Fenetti): If x1,x2… are exchangeable, then the joint probability distribution p(x1,x2….) has a form
p(x1,x2..) =
In simple words, any exchangeable sequence of r.v.s can be represented as a mixture of i.i.d r.v.s.
Finite Mixture Models
http://www.cs.berkeley.edu/~jordan/nips-tutorial05.ps
Stick breaking process
• Infinite sequence of Beta random variables.
βk ~ Beta(1,αo)
• Define infinite sequence of mixing proportions as:
- π1 = β1
- πk = βk πl (1-βl)
π1π3
β1
Β2(1-β1)
β3(1-β2)(1-β1)
π2 …..
G = Σk πkδk k = 1…∞
G is called Dirichlet Process.
Any finite partition (A1,… Ar) of the sample space, the random vector (G(A1),… G(Ar)) is distributed as finite dimensional Dirichlet Distribution.
G ~ DP(αo,Go)
Dirichlet Process Mixture Model
http://www.cs.berkeley.edu/~jordan/nips-tutorial05.ps
Marginalize DPMM to get CRP
Hierarchical Dirichlet Process
• Multiple groups of data
• Share some common properties
• Cluster shared across multiple groups
Naïve attempt
MLE estimation??
Hierarchical Bayesian Approach
θ
Dirichlet Process Admixture model
• Admixture model: For each document, repeatedly draw the mixing proportions from prior.
• DP will yield disjoint set of atoms for different documents
• If the set is disjoint => No sharing.
• No sharing => No chinese restaurant
Admixture Model
Hierarchical Dirichlet Process
• Issue is Go is continuous measure.
• Let Go be discrete and random? But how?
Hierarchical Dirichlet Process
• Issue is Go is continuous measure.
• Let Go be discrete and random? But how?
Introduce another DP on Go.
Go | ϒ, H ~ DP(ϒH)
Gj | α, Go ~ DP(αoGo)
• We just got more Bayesian.
Hierarchical Dirichlet Mixture Models
http://www.cs.berkeley.edu/~jordan/nips-tutorial05.ps
Chinese Restaurant Franchise
Global Menu….
Recap
• Introduction
• Exchangeability and De-Fenetti Theorem
• Dirichlet Process
• Hierarchical Dirichlet Process
Other metaphors
• Introduction
• Exchangeability and De-Fenetti Theorem
• Dirichlet Process
• Hierarchical Dirichlet Process
• Nested Chinese Restaurant Process
• Beta Process(Indian Buffet Process)
• Hierarchical Beta Process(The Dependents Diner Process)
• Non Parametric Regression (gaussian process)
• Inference Techniques (MCMC, Variational techniques)
DEMO
Feature engineering is Machine Learning
• Thanks to big-data, trend is to store everything without giving much thought apriori.
• Thanks to big-data frameworks (like Presto), which aids in data exploration.
• Let the models do heavy lifting.• Let the data learn underlying structure, i.e.
minimize the assumptions.• Deep Learning is another example, where rich
(although blackbox) models are used in feature learning.
Questions
References:1) http://www.cs.berkeley.edu/~jordan/nips-tutorial05.ps
2) http://mmds.imm.dtu.dk/presentations/whyeteh.pdf
3) Bayesian Non-Parametrics tutorial- MLSS 2013 Tubingen
4) Machine Learning: A Probabilistic Approach Kevin Murphy
And many more awesome tutorials available in the internet.