topic 1 clustering basics · topic 1 clustering basics cs898. overview basics (k-means) •...
TRANSCRIPT
Topic 1
Clustering Basics
CS898
Overview
Basics (K-means)
• variance clustering
• generalizations (parametric & non-parametric)
Kernel K-means
Probabilistic K-means,
• entropy clustering
Normalized Cut
Density biases
Spectral methods, bound optimization2
In the beginning there was…
Basic K-means:
squared L2 norm
features
K subsets of
objective function
input
output
extra parameters (K means)
In this talk: K-means refers mostly to this or related objectives(not to iterative Lloyd’s algorithm, 1957)
Basic K-means examples:
RGB features
color quantization
RGBXY features
superpixels
XY features only
Voronoi cells
pixelsfeatures
compared to RGB onlyXY adds spatial “compactness”
(quazi regularization)
Apply K-means to RGBXY features
Basic K-means examples:
Superpixels
[SLIC superpixels, Achanta et al., PAMI 2011]
K-means as non-parametric clustering
=
−=
K
k Spq
k
qp
Kk S
ffSE
1
2
||2
||||)(
equivalent (easy to check)
two standard formulas for sample variance
just plug-in
=k
k
Sq
qSk f||
1
k
Sk
Sk
no parametersµk
=
−=K
k Sp
kpk
fSE1
2
),(
qf
pf
pf
K-means as variance clustering criteria
)var(||1
kK
k
k
K SSE ==
k
Sk
Sk
both objectives can be written as
qf
pf
pf
=> K-means is good for “compact blobs”
K-means – common extensions
=
−K
k Spdkp
k
f1
• Parametric methods with arbitrary likelihoods P( ˑ|θ)
(probabilistic K-means) [Kearns, Mansour & Ng, UAI’97]=
−K
k Sp
kpk
fP1
)|(log
Examples of P (ˑ|θ) : Gaussian, gamma, exponential, Gibbs, etc.
• Parametric methods with arbitrary distortion measure
(distortion clustering)
d||||
• Non-parametric (pairwise) methods with any kernel or affinity measure
(kernel K-means, average association, average distortion, normalized cut)
),( yxk
Examples of : quadratic absolute truncated(K-means) (K-medians) (K-modes)
d||||
Could be juxtaposed with GMM/EM as hard clustering via ML parameter fitting
2
~ kpf −
e.g. Gaussian 2
2
2
||||exp~
−− fP
replace dot-products by arbitrary kernel k
Probabilistic K-means Example:
Elliptic K-means
for Normal (Gaussian)distribution
(squared)Mahalanobis distance
Examples: a) Z - normal random vector with
m – meanΣ - covariance
b) X = AZ + m for arbitrary vector m and matrix A
distribution of XX = AZ + m
Probabilistic K-means Example:
Elliptic K-means
Basic K-means
(squared)Mahalanobis distance
for Normal (Gaussian)distribution
Probabilistic K-means Example:
Elliptic K-means
Elliptic K-means
(squared)Mahalanobis distance
for Normal (Gaussian)distribution
Probabilistic K-means Example:
Entropy Clustering
=
−K
k Sp
kpk
fP1
)|(log
Monte-Carlo estimation formula
Using “optimal” distributions θk that minimizes cross entropy we get
entropy clustering criteria:
cross entropy
requires sufficiently descriptive (complex) class of probability models that can fit data well.
Probabilistic K-means:
summary
- model fitting (to data)- log-likelihood (model) parameter estimation
- complex data requires complex models
?
basic K-means works only for compact clusters (blobs)
that are linearly separable
from complex models
towards complex embeddings
From basic K-means to kernel K-means
(high-dimensional embedding story)
Example:
data can become linearly separableafter some non-linear embedding
(typically in high dimensional space)
for some (non-linear) embedding function
(explicit)K-means procedure:
(update at time t+1)
equivalent formulation:
for some (non-linear) embedding function
dim(H)x|Ω| embedding matrix
where
- cluster k at iteration t
|Ω| indicatorvector for cluster k
From basic K-means to kernel K-means
(high-dimensional embedding story)
Assume for now that such embedding is given
equivalent formulation:
for some (non-linear) embedding function
Gram matrix
dot products
- cluster k at iteration t
(explicit)K-means procedure:
(update at time t+1)
From basic K-means to kernel K-means
(high-dimensional embedding story)
Assume for now that such embedding is given
for some (non-linear) embedding function
(implicit)kernel K-means procedure:
(update at time t+1)
Requires only kernel matrix K , no need to know explicit embedding Φ
Gram matrix
- cluster k at iteration t
(explicit)K-means procedure:
(update at time t+1)
From basic K-means to kernel K-means
(high-dimensional embedding story)
equivalent
Assume for now that such embedding is given
Kernel trick: start with (any ?) kernel K
If we start from given pairwise affinities (kernel matrix K), sometimes it may still be useful to
think about embedding implicitly defined by the kernel (via decomposition )
(Mercer theorem : any p.s.d. kernels can be decomposed that way)
Q: why even worry about embedding Φ when using kernel K-means procedure?
A: (HINT) Think about convergence. What do we minimize via kernel K-means procedure?
Kernel Trick: p.s.d. kernels K are a standard way to (implicitly) define
some high-dimensional embedding Φ (corresponding to decomposition )
Q: what is dimension of each Φp ?
Example: Gaussian kernel
Kernel-induced embedding:
- isometry
and the corresponding kernel-induced metric:
kernel defines an inner product:(in the original feature space)
Kerned-defined Euclidean embedding is isometric to
the original features with kernel-induced metric
NOTE:
Kernel-induced embedding:
- non-linear separation of original features
high-dimensional isometric embedding induced by kernel Kcan make clusters linearly separable
original feature space with kernel-induced metric
kernel-induced Euclidean embedding
Intuition for such “magic” behind commonly used kernels (e.g. Gaussian)?
From basic K-means to kernel K-means
(robust metric story)
kernel K-means objective:
remember
robust metric focuses on local distortion (deemphasizes larger distances)
basic (linear) kernel
Examples:
squared Euclidean distance
distance in standard K-means
Gaussian kernel
distance in Gaussian kernel K-means2||||
0
2|||| k
From basic K-means to kernel K-means
(robust metric story)
kernel K-means objective:
and kernel-induced metric:
remember
robust metric focuses on local distortion (deemphasizes larger distances)
2||||
0
2|||| k
From basic K-means to kernel K-means
(robust metric story)
S1 S2
σ
On importance of
positive-semi-definite (p.s.d.) kernels K
- Given any (e.g. non-p.s.d.) kernel, “diagonal shift” allows to formulate an equivalent kernel clustering objective with p.s.d. kernel (for sufficiently large scalar δ)
easy to verify equivalence of kernel K-means objectives for any scalar δwhile kernel K-means procedure is modified by the “shift” above
- (Mercer theorem) p.s.d. guarantees existence of explicit Euclidean embedding s.t.
that is
This allows to prove that implicit kernel K-means procedure convergesdue to its equivalence to convergent explicit K-means procedure for some
Weak kernel K-means
versus
?
Weak kernel K-means
versus
?
Weak kernel K-means
versus
?
=
for(due to isometry)
=Each corresponds to some .(These two give the same solution S where two objectives are equal since embedding is isometric.)
The opposite is not true.
Weak kernel K-means
≥
Implicit search space for(higher-dimensional embedding space H)
is larger than search space for(original feature space)
=
=
for(due to isometry)
Weighted K-means and
Weighted kernel K-means
(unary) distortion between a point and a model
(pair-wise) distortion between two points
K=2
unary and pair-wise distortion clustering(general weighted case)
pKM
probabilistic K-means(ML model fitting)
kKM
kernel K-means(pairwise clustering)
basic K-means
p.d. kernel distance
weakkernel clustering
(unary) Hilbertian distortion
normalized cuts
K-modes(mean-shift)
GMM fitting
entropy clustering
elliptic K-means
gamma fitting
Gaussian kernel K-means
spectralratio cuts
distorton
Gibbs fitting
average cut
average association
averagedistortion
complex models complex embeddings(model parameter fitting)
(non-parametric)
Kernel Clustering
32
• kernel K-means, average association, Normalized Cuts, …• density biases: isolation of modes or sparse subsets• bound optimization
More on
kernel K-meansnon-parametric (kernel) clustering
fp
fq
Apq = k( fp ,fq )
- objective
kernel K-meansnon-parametric (kernel) clustering
fp
fq
Apq = k( fp ,fq )
- objective
explicit features fp are unnecessary
kernel K-meansnon-parametric (kernel) clustering
only need affinity (or kernel) matrix
A = [Apq ]
p
q
Ω - set of all points(graph nodes)
if necessary, “embedding” s.t.
can be found for p.s.d. A(via eigen decomposition)
as suggested by MERCER THEOREM
kernel K-meansnon-parametric (kernel) clustering
Ω - set of all points(graph nodes)
S1
S3
S2
kernel K-means or average associationnon-parametric (kernel) clustering
S1
S3
S2
“self-association” of cluster Sk
kernel K-means or average associationnon-parametric (kernel) clustering
S1
S3
S2
in matrix notation:
Sk - indicator vector
‘ means transpose
kernel K-means or average associationnon-parametric (kernel) clustering
S1
S3
S2
in matrix notation:
Sk - indicator vector
‘ means transpose
kernel K-means or average associationnon-parametric (kernel) clustering
e.g. for Gaussian kernel
kernel K-meansbasic K-meansfor
Why?
local “compactness”global “compactness”
?
Basic K-means vs Kernel Clustering
Basic K-means
Kernel Clustering
compact blobsin RGB space
segments inRGBXY space
(not blobs!)
color quantization
compact blobsin RGBXY space
[Achanta et al., PAMI 2011]
super-pixels
[Shi&Malik 2000]
segmentation
kernel K-means or average associationnon-parametric (kernel) clustering
kernel K-means
e.g. for Gaussian kernel
density mode isolation[Marin et al. PAMI 2019]
inhomogeneous data density
“tight” clusters [Shi & Malik, PAMI 2000]
for “small” kernels (empirically observed)
by reduction to continuous Gini criterion
mode bias in discrete Gini criterion [Breiman, Machine Learning 1996]
discrete valued data
RGB features (no XY!)
kernel K-means or average associationnon-parametric (kernel) clustering
properties for kernel bandwidths σ
0∞
r-small σ data range diameter
Breiman’s bias[Marin et al. PAMI 2019]
reduces to basic K-means“linear separation” and
“equi-cardinality” bias [Ng et al. UAI’96]
kernel K-means or average associationnon-parametric (kernel) clustering
properties for kernel bandwidths σ
0∞
r-small σ data range diameter
there maybe no good (unbiased) solution in-between
kernel K-means or average associationnon-parametric (kernel) clustering
A solution: density equalization[Marin et al. PAMI 2019]
Theorem: adaptive bandwidths for data in RN implicitly
transform data density r
Density Law (basic form)?no fixed bandwidth
will generate this result
?
Simple density equalization example
NOTE: same as a heuristicby Zelnik & Perona [NIPS 2004]for another clustering objective
average association withadaptive bandwidths σp
as above
Density Law
using
and standard density estimate
volume of ballcontaining KNN
KNN ball radius
S1
S3
S2
Other kernel (graph) clustering objectives
so far only looked at
“self-association” for Sk
Other kernel (graph) clustering objectives
“cut” for Sk
S1
S3
S2
S1
S3
S2
“self-association” for Sk
so far only looked at
Average Cut
Other kernel (graph) clustering objectives
• Ratio Cut, Cheeger cut, isoperimetric number, conductance
• spectral graph theory, electrical flows, random walks
S1
S3
S2
Other kernel (graph) clustering objectives
Average Cut
Other kernel (graph) clustering objectives
Average Cut
Other kernel (graph) clustering objectives
“node degree”
normalization
Average Cut
Normalized Cut
Other kernel (graph) clustering objectives
= Normalized Average Association
normalization
Average Cut
Normalized Cut
Summary of
common kernel clustering objectives
Average Cut
Normalized Cut = Normalized Average Association
normalization
Average Association(as discussed earlier)
bias to sparse subsets bias to density mode
?
Normalized Cut (NC) [Shi & Malik, 2000]
small bandwidth ( 2.47)
lack of non-linear separabilityNC cuts isolated points
large bandwidth ( 2.48)
no fixed bandwidth will generate the good result here
still has bias to sparse subset(the opposite of density mode)
Normalized Cut (NC) [Zelnik-Perona, 2004]
with density equalization(via locally adaptive bandwidths)
Average Association (kernel K-means)
gives a similar result for such p … why ?Question:
(locally adaptive kernel)
2𝑅𝐾𝑝
Average Cut, Normalized Cut, Average Association
Equivalence (after density equalization)
Avr. Assoc.(kernel K-means)
after density equalization[Marin et al. 2019]
Avr. Cut(Cheeger sets)
Norm. Cut
=c
Norm. Avr. Assoc.
For simplicity assume “KNN kernel”
(locally adaptive kernel)
2𝑅𝐾𝑝
Optimization
• block-coordinate descent (Lloyd’s algorithm)
• spectral relaxation
• bound optimization
Spectral Relaxation (quick overview)
In the context of kernel K-means (average association)
normalized (L2 norm is 1)cluster indicators
Spectral Relaxation (quick overview)
In the context of kernel K-means (average association)
- |Ω| x K matrix
k-th element on
the diagonal ofmatrix - K x K matrix
Spectral Relaxation (quick overview)
In the context of kernel K-means (average association)
k-th element of
original optimization problem (NP hard)
integrality of indicators Sk is relaxed,
optimization over a unit sphere.
relaxed problem (closed form solution)
Spectral Relaxation (quick overview)
optimization over a unit sphere.
relaxed problem (closed form solution)
(one of generalizations)
Rayleigh quotient problem
Closed form solution:
Z1 , Z2 , … , ZK are (unit) eigenvectors of matrix Acorresponding to its K largest eigenvalues
intuition: consider vector x
and so on…
Kernel clustering
via
bound optimization
Lloyd’s algorithm as bound optimization
64
(explicit)K-means procedure:
Lloyd’s algorithm
corresponds to a block coordinate descent for K-means objective
t+1
Remember an equivalent objective minimizing out parameter m
where and
Lloyd’s algorithm as bound optimization
65
(explicit)K-means procedure:
Lloyd’s algorithm
corresponds to a block coordinate descent for K-means objective
t+1
Remember an equivalent objective minimizing out parameter m
Lloyd’s algorithm as bound optimization
66
(explicit)K-means procedure:
Lloyd’s algorithm
corresponds to a block coordinate descent for K-means objective
t+1
Remember an equivalent objective minimizing out parameter mlinear function w.r.t. S
Lloyd’s algorithm as bound optimization
67
E(S,mSt)
St St+1
E(S)
(explicit)K-means procedure:
Lloyd’s algorithm
t+1
E(S,mSt+1)
compute new means mSt+1
≥ E(S) E(St ,mSt) = E(St)
optimal for bound E(S,mSt)
over binary indicators
guaranteed energy
decrease
E(S)
St St+1
E(St)
E(St+1)
Bound optimization, in general
At+1 (S)
At (S)
68
Kernel bound
Lemma 1 (concavity)
Function e : R|Ω| → R is concave over region Sk > 0 given p.s.d. affinity matrix A := [Apq].
at(S)(I)
(II)
at+1(S)e(Sk)
A bound is given by the first-order Taylor expansion:
NOTE: optimizing this unary bound for KC (alone) is equivalent to iterative kernel k-means a la [Lloyd’57]
(intuition came from observation: Lloyd’s algorithm unary bound optimization)
Kernel bound
(approximate)
Spectral bound
Main idea:
- standard eigen analysis (PCA) gives low dimensional embedding
or (equivalently) low rank matrix such that
minimizing Frobenius error
- use linear bound from lemma 1 for kernel matrix
NOTE: optimizing such unary bound for KC (alone) isequivalent to iterative k-means [Lloyd algorithm] for embedding
a la discretization heuristic after spectral relaxation methods
(approximate)
Spectral bound
Empirical motivation for low dimensional spectral approximation:
NOTE: optimizing this unary bound alone for KC (without regularization) is similar to discretization heuristic (K-means) for spectral relaxation [Shi&Malik, 2000]
approx KC energy (low dimensions)exact KC energy
progression of iterative (kernel) K-means algorithm [Lloyd]
approx KC energy (high dimensions)exact KC energy