bregman divergences 08-05-2008
TRANSCRIPT
Bregman Divergences
Barnabás Póczos
RLAI Tea TalkUofA, Edmonton
Aug 5, 2008
2
Contents
Bregman Divergences Definition Properties
Bregman Matrix Divergences Relation to Exponential Family Applications
Generalization of PCA to Exponential Family Generalized2 Linear2 Models Clustering / Coclustering with Bregman Divergences Generalized Nonnegative Matrix Factorization
Conclusion
3
Bregman Divergences (Euclidean distance)
Squared Euclidean distance is a Bregman divergence(upcoming figs are borrowed from Dhillon)
2
4
Bregman Divergences (KL-divergence)
Generalized Relative Entropy (also called generalized KL-divergence) is another Bregman divergence
5
Bregman Divergences (Itakura-Saito)
Itakura-Saito distance is another Bregman divergence (used in signal processing, also known as Burg entropy)
6
Examples of Bregman Divergences
7
Properties of the Bregman Divergences
Euclidean special case:
γ
x
z
y
c
a
b
8
Properties of the Bregman Divergences
Nearness in Bregman divergence: the Bregman projection of y onto a convex set Ω.
Generalized Pythagoras theorem:
When Ω is affine set, the Pythagoras theorem holds with equality.
Ω
Opposite to triangle inequality:
9
(Regular) Exponential Families
10
Gaussian Distribution
Note: Gaussian distribution $ Squared Loss from the expected value µ
11
Poisson Distribution
The Poisson distribution:
The Poisson distribution is a member of exponential family. Its expected value µ=λ.
Is there a Divergence associated with the Poisson distribution?
Yes! p(x) can be rewritten as
Implication: Poisson distribution $ Relative Entropy
The Poisson distribution:
The Poisson distribution is a member of exponential family. Its expected value µ=λ.
Is there a Divergence associated with the Poisson distribution?
Yes! p(x) can be rewritten as
The Poisson distribution:
Implication: Poisson distribution $ Relative Entropy
12
Exponential Distribution
The Exponential distribution is a member of exponential family. Its expected value µ=1/λ.
Is there a Divergence associated with the Exponential distribution?
Yes! p(x) can be rewritten as
The Exponential distribution:
Implication: Exponential distribution $ Itakura-Saito Distribution
13
Fenchel Conjugate
Properties of the Fenchel conjugate:
Defintion: The Fenchel conjugate of function f is defined as:
14
Bregman Divergences and the Exponential FamilyBijection Theorem
15
16
Bregman Matrix DivergencesAn immediate solution would be the componentwise sum of Bregman divergences.However, we can get more interesting divergences using the general definition.
17
Bregman Divergences of Hermitian matrices
A complex square matrix A is Hermitian, if A = A*.The eigenvalues of a Hermitian matrix are real.
Let
18
Burg Matrix Divergence (Logdet divergence)
19
Von Neumann Matrix Divergence
20
Applications, Matrix inequalities
Hadamard inequality:
Proof:
Another inequality:
Proof:
What is more, here we can arbitrarily permute the eigenvalues!
21
Applications of Bregman divergences
Clustering Partition the columns of a data matrix, so that “similar” columns are in the
same partition (Banerjee et al. JMLR, 2005)
Co-clustering Simultaneously partition both the rows and columns of a data matrix
(Banerjee et al. JMLR, 2007)
Low-Rank Matrix Approximation Exponential Family PCA (Collins et al, NIPS 2001)
POMDP (Gordon, NIPS,2002) Non-negative matrix factorization (Dhillon & Sra, NIPS 2005)
Generalized2 Linear2 Models (Gordon, NIPS,2002)
Online learning (Warmuth, COLT2000) Metric Nearness Problem
Given a matrix of “distances”, find the “nearest” matrix of distances such that all distances satisfy the triangle inequality (Dhillon et al, 2004)
22
Generalized2 Linear2 Models (GL)2M
Goal:
GLM Special cases:
PCA, SVD Exp-family PCA Infomax ICA Linear regression Nonnegative matrix factorization
23
What is a good loss function?
Euclidean metric as a loss function: instead of 1000 predicting 1010 is just
as bad as predicting 3 instead of -7… Sigmoid regression
exp many local minima in dim
The “log loss” function is convex in θ! We say f(z) and the log loss match each other.
24
Searching for matching loss
25
Searching for matching loss
26
Special cases
Thus,
Log loss, entropic loss
Other special cases:
27
Logistic regression
28
(GL)2M algorithm
GLM cost:
GLM goal:
(GL)2M goal:
(GL)2M cost:
The (GL)2M algorithm, fix point equations::
29
Robot Navigation
A corridor in the CMU CS building with initial belief: (it doesn’t know which end)
Belief space = R550
550 states. (275 positons x 2 orientation)
Robot can: - sense both side walls- compute an accurate estimate of its lateral position
Robot cannot:- resolve its position along the corridor, unless its near an observable feature- tell whether its pointing left or right
30
Robot Navigation
The belief space is large, but sparse and compressible. The belief vectors lie on a nonlinear manifold.This method can be used for planning, too.They factored a matrix of 400 beliefs using feature space ranks l=3,4,5.
f(z)=exp(z), H*=10-12||V||2, G*= 10-12||U||2+∆(U)
A belief vector using belief tracker algorithm
Reconstructions using l=3,4,5 ranks
With PCA, they need 85 dimensions to match (GL)^2M rank-5 decompositionand 25 dimension for the rank-3 decomposition
31
Nonnegative matrix factorization
Goal:
Cost functions:
Algorithms:
32
Nonnegative matrix factorization, results
CBCL face image databaseP. Hoyer, sparse NMF algorithm.
With sparse constraints Without constraints
33
Exponential Family PCAPCA 1
PCA 2
Cost function
Special case
34
Exponential family PCA, Results
35
Clustering with Bregman Divergences
36
The Original Problem of Bregman
37
Conclusion
Introduced the Bregman divergence Relationship to Exponential family Generalization to matrices Applications:
Matrix inequalities Exponential family PCA NMF GLM Clustering / Biclustering Online learning
Bregman divergences propose new algorithms Lots of existing algorithms turn to be special case Matching loss function can help to decrease the number of local
minima
38
References Matrix Nearness Problems with Bregman Divergences
I. S. Dhillon and J. A. TroppSIAM Journal on Matrix Analysis and Applications, vol. 29, no. 4, pages 1120-1146, November 2007.
A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix ApproximationsA. Banerjee, I. S. Dhillon, J. Ghosh, S. Merugu, and D. S. ModhaJournal of Machine Learning Research (JMLR), vol. 8, pages 1919-1986, August 2007.
Clustering with Bregman DivergencesA. Banerjee, S. Merugu, I. S. Dhillon, and J. GhoshJournal of Machine Learning Research (JMLR), vol. 6, pages 1705-1749, October 2005.
A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix ApproximationsA. Banerjee, I. S. Dhillon, J. Ghosh, S. Merugu, and D. S. ModhaProceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 509-514, August 2004.Clustering with Bregman DivergencesA. Banerjee, S. Merugu, I. S. Dhillon, and J. GhoshProceedings of the Fourth SIAM International Conference on Data Mining, pages 234-245, April 2004
Nonnegative Matrix Approximation: Algorithms and ApplicationsS. Sra and I. S. DhillonUTCS Technical Report #TR-06-27, June 2006
Generalized Nonnegative Matrix Approximations with Bregman DivergencesI. S. Dhillon and S. SraNIPS, pages 283-290, Vancouver Canada, December 2005.(Also appears as UTCS Technical Report #TR-05-31, June 1, 2005.
39
PPT slides
Irina Rish Bregman Divergences in Clustering and Dimensionality reduction
Manfred K. Warmuth COLT2000
Inderjit S. Dhillon Machine Learning with Bregman Divergences Low-Rank Kernel Learning with Bregman Matrix Divergences Matrix Nearness Problems Using Bregman Divergences Information Theoretic Clustering, Co-clustering and Matrix
Approximations