bregman divergences 08-05-2008

39
Bregman Divergences Barnabás Póczos RLAI Tea Talk UofA, Edmonton Aug 5, 2008

Upload: uno-mas-de-madrid

Post on 29-Oct-2015

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bregman Divergences 08-05-2008

Bregman Divergences

Barnabás Póczos

RLAI Tea TalkUofA, Edmonton

Aug 5, 2008

Page 2: Bregman Divergences 08-05-2008

2

Contents

Bregman Divergences Definition Properties

Bregman Matrix Divergences Relation to Exponential Family Applications

Generalization of PCA to Exponential Family Generalized2 Linear2 Models Clustering / Coclustering with Bregman Divergences Generalized Nonnegative Matrix Factorization

Conclusion

Page 3: Bregman Divergences 08-05-2008

3

Bregman Divergences (Euclidean distance)

Squared Euclidean distance is a Bregman divergence(upcoming figs are borrowed from Dhillon)

2

Page 4: Bregman Divergences 08-05-2008

4

Bregman Divergences (KL-divergence)

Generalized Relative Entropy (also called generalized KL-divergence) is another Bregman divergence

Page 5: Bregman Divergences 08-05-2008

5

Bregman Divergences (Itakura-Saito)

Itakura-Saito distance is another Bregman divergence (used in signal processing, also known as Burg entropy)

Page 6: Bregman Divergences 08-05-2008

6

Examples of Bregman Divergences

Page 7: Bregman Divergences 08-05-2008

7

Properties of the Bregman Divergences

Euclidean special case:

γ

x

z

y

c

a

b

Page 8: Bregman Divergences 08-05-2008

8

Properties of the Bregman Divergences

Nearness in Bregman divergence: the Bregman projection of y onto a convex set Ω.

Generalized Pythagoras theorem:

When Ω is affine set, the Pythagoras theorem holds with equality.

Ω

Opposite to triangle inequality:

Page 9: Bregman Divergences 08-05-2008

9

(Regular) Exponential Families

Page 10: Bregman Divergences 08-05-2008

10

Gaussian Distribution

Note: Gaussian distribution $ Squared Loss from the expected value µ

Page 11: Bregman Divergences 08-05-2008

11

Poisson Distribution

The Poisson distribution:

The Poisson distribution is a member of exponential family. Its expected value µ=λ.

Is there a Divergence associated with the Poisson distribution?

Yes! p(x) can be rewritten as

Implication: Poisson distribution $ Relative Entropy

The Poisson distribution:

The Poisson distribution is a member of exponential family. Its expected value µ=λ.

Is there a Divergence associated with the Poisson distribution?

Yes! p(x) can be rewritten as

The Poisson distribution:

Implication: Poisson distribution $ Relative Entropy

Page 12: Bregman Divergences 08-05-2008

12

Exponential Distribution

The Exponential distribution is a member of exponential family. Its expected value µ=1/λ.

Is there a Divergence associated with the Exponential distribution?

Yes! p(x) can be rewritten as

The Exponential distribution:

Implication: Exponential distribution $ Itakura-Saito Distribution

Page 13: Bregman Divergences 08-05-2008

13

Fenchel Conjugate

Properties of the Fenchel conjugate:

Defintion: The Fenchel conjugate of function f is defined as:

Page 14: Bregman Divergences 08-05-2008

14

Bregman Divergences and the Exponential FamilyBijection Theorem

Page 15: Bregman Divergences 08-05-2008

15

Page 16: Bregman Divergences 08-05-2008

16

Bregman Matrix DivergencesAn immediate solution would be the componentwise sum of Bregman divergences.However, we can get more interesting divergences using the general definition.

Page 17: Bregman Divergences 08-05-2008

17

Bregman Divergences of Hermitian matrices

A complex square matrix A is Hermitian, if A = A*.The eigenvalues of a Hermitian matrix are real.

Let

Page 18: Bregman Divergences 08-05-2008

18

Burg Matrix Divergence (Logdet divergence)

Page 19: Bregman Divergences 08-05-2008

19

Von Neumann Matrix Divergence

Page 20: Bregman Divergences 08-05-2008

20

Applications, Matrix inequalities

Hadamard inequality:

Proof:

Another inequality:

Proof:

What is more, here we can arbitrarily permute the eigenvalues!

Page 21: Bregman Divergences 08-05-2008

21

Applications of Bregman divergences

Clustering Partition the columns of a data matrix, so that “similar” columns are in the

same partition (Banerjee et al. JMLR, 2005)

Co-clustering Simultaneously partition both the rows and columns of a data matrix

(Banerjee et al. JMLR, 2007)

Low-Rank Matrix Approximation Exponential Family PCA (Collins et al, NIPS 2001)

POMDP (Gordon, NIPS,2002) Non-negative matrix factorization (Dhillon & Sra, NIPS 2005)

Generalized2 Linear2 Models (Gordon, NIPS,2002)

Online learning (Warmuth, COLT2000) Metric Nearness Problem

Given a matrix of “distances”, find the “nearest” matrix of distances such that all distances satisfy the triangle inequality (Dhillon et al, 2004)

Page 22: Bregman Divergences 08-05-2008

22

Generalized2 Linear2 Models (GL)2M

Goal:

GLM Special cases:

PCA, SVD Exp-family PCA Infomax ICA Linear regression Nonnegative matrix factorization

Page 23: Bregman Divergences 08-05-2008

23

What is a good loss function?

Euclidean metric as a loss function: instead of 1000 predicting 1010 is just

as bad as predicting 3 instead of -7… Sigmoid regression

exp many local minima in dim

The “log loss” function is convex in θ! We say f(z) and the log loss match each other.

Page 24: Bregman Divergences 08-05-2008

24

Searching for matching loss

Page 25: Bregman Divergences 08-05-2008

25

Searching for matching loss

Page 26: Bregman Divergences 08-05-2008

26

Special cases

Thus,

Log loss, entropic loss

Other special cases:

Page 27: Bregman Divergences 08-05-2008

27

Logistic regression

Page 28: Bregman Divergences 08-05-2008

28

(GL)2M algorithm

GLM cost:

GLM goal:

(GL)2M goal:

(GL)2M cost:

The (GL)2M algorithm, fix point equations::

Page 29: Bregman Divergences 08-05-2008

29

Robot Navigation

A corridor in the CMU CS building with initial belief: (it doesn’t know which end)

Belief space = R550

550 states. (275 positons x 2 orientation)

Robot can: - sense both side walls- compute an accurate estimate of its lateral position

Robot cannot:- resolve its position along the corridor, unless its near an observable feature- tell whether its pointing left or right

Page 30: Bregman Divergences 08-05-2008

30

Robot Navigation

The belief space is large, but sparse and compressible. The belief vectors lie on a nonlinear manifold.This method can be used for planning, too.They factored a matrix of 400 beliefs using feature space ranks l=3,4,5.

f(z)=exp(z), H*=10-12||V||2, G*= 10-12||U||2+∆(U)

A belief vector using belief tracker algorithm

Reconstructions using l=3,4,5 ranks

With PCA, they need 85 dimensions to match (GL)^2M rank-5 decompositionand 25 dimension for the rank-3 decomposition

Page 31: Bregman Divergences 08-05-2008

31

Nonnegative matrix factorization

Goal:

Cost functions:

Algorithms:

Page 32: Bregman Divergences 08-05-2008

32

Nonnegative matrix factorization, results

CBCL face image databaseP. Hoyer, sparse NMF algorithm.

With sparse constraints Without constraints

Page 33: Bregman Divergences 08-05-2008

33

Exponential Family PCAPCA 1

PCA 2

Cost function

Special case

Page 34: Bregman Divergences 08-05-2008

34

Exponential family PCA, Results

Page 35: Bregman Divergences 08-05-2008

35

Clustering with Bregman Divergences

Page 36: Bregman Divergences 08-05-2008

36

The Original Problem of Bregman

Page 37: Bregman Divergences 08-05-2008

37

Conclusion

Introduced the Bregman divergence Relationship to Exponential family Generalization to matrices Applications:

Matrix inequalities Exponential family PCA NMF GLM Clustering / Biclustering Online learning

Bregman divergences propose new algorithms Lots of existing algorithms turn to be special case Matching loss function can help to decrease the number of local

minima

Page 38: Bregman Divergences 08-05-2008

38

References Matrix Nearness Problems with Bregman Divergences

I. S. Dhillon and J. A. TroppSIAM Journal on Matrix Analysis and Applications, vol. 29, no. 4, pages 1120-1146, November 2007.

A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix ApproximationsA. Banerjee, I. S. Dhillon, J. Ghosh, S. Merugu, and D. S. ModhaJournal of Machine Learning Research (JMLR), vol. 8, pages 1919-1986, August 2007.

Clustering with Bregman DivergencesA. Banerjee, S. Merugu, I. S. Dhillon, and J. GhoshJournal of Machine Learning Research (JMLR), vol. 6, pages 1705-1749, October 2005.

A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix ApproximationsA. Banerjee, I. S. Dhillon, J. Ghosh, S. Merugu, and D. S. ModhaProceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 509-514, August 2004.Clustering with Bregman DivergencesA. Banerjee, S. Merugu, I. S. Dhillon, and J. GhoshProceedings of the Fourth SIAM International Conference on Data Mining, pages 234-245, April 2004

Nonnegative Matrix Approximation: Algorithms and ApplicationsS. Sra and I. S. DhillonUTCS Technical Report #TR-06-27, June 2006

Generalized Nonnegative Matrix Approximations with Bregman DivergencesI. S. Dhillon and S. SraNIPS, pages 283-290, Vancouver Canada, December 2005.(Also appears as UTCS Technical Report #TR-05-31, June 1, 2005.

Page 39: Bregman Divergences 08-05-2008

39

PPT slides

Irina Rish Bregman Divergences in Clustering and Dimensionality reduction

Manfred K. Warmuth COLT2000

Inderjit S. Dhillon Machine Learning with Bregman Divergences Low-Rank Kernel Learning with Bregman Matrix Divergences Matrix Nearness Problems Using Bregman Divergences Information Theoretic Clustering, Co-clustering and Matrix

Approximations