a convex exemplar-based approach to mad-bayes dirichlet

MAD-Bayes Convex Exemplar-based Approach Optimality Guarantees ADMM for Structural-Regularized Programs Experiments and Results

A Convex Exemplar-based Approach to

MAD-Bayes Dirichlet Process Mixture Models

Presenter: Jimmy Xin Lin

Joint work with Ian E.H. Yen, Kai Zhong, Pradeep Ravikumarand Inderjit S. Dhillon

The University of Texas At Austin

June 9, 2015

I. En-Hsu Yen, X. Lin, K. Zhong, P. Ravikumar, I. Dhillon The University of Texas At Austin


Table of Contents

1 MAD-Bayes: MAP-based Asymptotic Derivation from Bayes

2 Convex Exemplar-based Approach

3 Optimality Guarantees

4 ADMM for Structural-Regularized Programs

5 Experiments and Results



MAD-Bayes: Dirichlet Process Mixture

MAD-Bayes derives the asymptotic log-likelihood of DP mixture whenvariance s2

X of dist. P(xi |µzi ) approaches 0.

where D(xi ,µk ) is the Bregman Divergence (corresponds to P(x |z ,µ))between sample xi and k-th mean parameter µk , and K is the number ofclusters.

Inference (by MCMC etc.) is replaced by a MAP optimization programw.r.t. zi , µk , and K .



MAD-Bayes: Dirichlet Process MixtureMAD-Bayes of DP mixture yields a combinatorial problem

minzi2[K ],µk2Rp ,K

N

Âi=1

D(xi ,µzi )+lK .

Brian’s paper [1] solves above via an algorithm similar to K -means, whereK is incremented whenever doing so decreases objective.The approach extends to Hierarchical DP (HDP) mixture:

minzi2[Kg ],µk ,Kg ,Kd

D

Âd=1

Âi2d

D(xi ,µzi )+qD

Âd=1

Kd +lKg

[1] Kulis, Brian and Jordan, Michael I. Revisiting k-means: New algorithms via bayesiannonparametrics. ICML, 2012.



Table Of Contents








Convex Exemplar-based Approach

Observation:

minzi2[K ],µk2Rp ,K

N

Âi=1

D(xi ,µzi )+lK .

is equivalent to

minW2{0,1}N⇥J

D�W +lkW k•,1

s.t. W1= 1,

where Dij = D(xi ,µj ),

assuming there is a Exemplar Set E = {µj}Jj of possible mean parameters. The

program becomes convex when we relax wi ,j from {0,1} to [0,1]. This conceptof exemplar was also employed in K -medoid problem [1].

[1] Nellore, Abhinav and Ward, Rachel. Recovery Gurantees for exemplar-based clustering.arXiv:1309.3256, 2013.



Convex Exemplar-based Approach

For HDP:

D

Âd=1

Âi2d

D(xi ,µzi )+qD

Âd=1

Kd +lKg

is equivalent to

minW

D�W +qkW kG +lkW k•,1

s.t. W1= 1

where Dij = D(xi ,µj ),

assuming there is a Exemplar Set E = {µj}Jj of possible mean parameters. The

program becomes convex when relaxing wi ,j from {0,1} to [0,1].



Table Of Contents








Optimality Guarantee

Suppose the convex relaxation

minW2[0,1]N⇥J

D�W +lkW k•,1

s.t. W1= 1,(1)

has integer solution, the solution is also optimal to the original combinatorialproblem. The following shows a clustering satisfying a separation conditionwhich leads to an Integer solution for the Convex Program (1) for a range of l .

Theorem

Suppose there exists a clustering {Sk}k2M for which we can find l such that

maxk2M

maxi ,j2Sk

Nkdij < l < min(k,l2M,k 6=l)

min(i2Sk ,j2Sl )

Nkdij (2)

where Nk = |Sk | and dij = D(xi ,xj )�D(xi ,xM(i)), then the integer solution W ⇤

realizing {Sk}k2M is unique optimal solution to (1).




According to the theorem, the larger extent of separation a clustering has,the wider range of l producing that clustering.

(i) K(l ) = kW (l )⇤k•,1 monotonically decreases with l .(ii) W ⇤(l ) and K(l ) have one-one mapping.

100

101

102

0

2

4

6

8

10

12

lambda

K

Iris

Fractional SampleInteger SampleInteger Range

100

101

102

0

5

10

15

20

25

30

lambda

K

Glass





Similar guarantee can be obtained for HDP formulation, where clustersthat do not share a data set can have smaller separation requirement.

(i) Kg , Kl monotonically decrease with l , q respectively.(ii) W ⇤(l ,q) and (Kg ,Kl ) have one-one mapping.



Table Of Contents








Solving Structural-Regularized Programs

We employ an ADMM procedure that has linear convergence to the optimumof the given convex program

minW

D�W +qkW kG +lkW k•,1

s.t. W1= 1

(3)

by introducing dual variables Y1,Y2 and consensus variables Z s.t. (3) can bedecomposed into a sequence of (i) sub-problem with only simplex constraint

W(t+1)1 = argmin

W2[0,1]N⇥JD�W +Y

(t)1 �W +

r2kW �Z (t)k2

s.t. W1= 1

and (ii) sub-problem with only structural regularizer

W(t+1)2 = argmin

W2[0,1]N⇥JqkW kG +lkW k•,1+Y

(t)2 �W +

r2kW �Z (t)k2.



Solving Structural-Regularized Program

The 1st sub-problem:

W(t+1)1 = argmin

W2[0,1]N⇥JD�W +Y

(t)1 �W +

r2kW �Z (t)k2

s.t. W1= 1

can be solved via Simplex Projection.

The 2nd sub-problem:

W(t+1)2 = argmin

W2[0,1]N⇥JqkW kG +lkW k•,1+Y

(t)2 �W +

r2kW �Z (t)k2.

has closed-form solution via proximal mapping proxl (proxq (.)).

ADMM Update:

Z (t+1) (W(t+1)1 +W

(t+1)2 )/2

Y(t+1)q Y

(t)q +a(W

(t+1)q �Z (t+1)), for q = 1,2



Table Of Contents








Results

The MAD-Bayes objective function achieved by di↵erent approaches.

[1] Kulis, Brian and Jordan, Michael I. Revisiting k-means: New algorithms via bayesiannonparametrics. ICML, 2012.



Future Works

The convex optimization approach applies to other exemplar-basedunsupervised learning problems such as Multiple Sequence Alignment andSequence Motif Finding, which can be seen as exemplar-based version ofHidden Markov Model.

For DP mixture models, the current algorithm runs in O(N2) time. Agreedy Column Generation method that reduces complexity from O(N2)to O(NK) is desired.



Question Time

100

101

102

0

2

4

6

8

10

12

lambda

K

Iris


100

101

102

0

5

10

15

20

25

30

lambda

K

Glass




Thank you!


a convex exemplar-based approach to mad-bayes dirichlet

Documents