topic 1 clustering basics · topic 1 clustering basics cs898. overview basics (k-means) •...

Topic 1

Clustering Basics

CS898

Overview

Basics (K-means)

• variance clustering

• generalizations (parametric & non-parametric)

Kernel K-means

Probabilistic K-means,

• entropy clustering

Normalized Cut

Density biases

Spectral methods, bound optimization2

In the beginning there was…

Basic K-means:

squared L2 norm

features

K subsets of

objective function

input

output

extra parameters (K means)

In this talk: K-means refers mostly to this or related objectives(not to iterative Lloyd’s algorithm, 1957)

Basic K-means examples:

RGB features

color quantization

RGBXY features

superpixels

XY features only

Voronoi cells

pixelsfeatures

compared to RGB onlyXY adds spatial “compactness”

(quazi regularization)

Apply K-means to RGBXY features

Basic K-means examples:

Superpixels

[SLIC superpixels, Achanta et al., PAMI 2011]

K-means as non-parametric clustering

=

−=

K

k Spq

k

qp

Kk S

ffSE

1

2

||2

||||)(

equivalent (easy to check)

two standard formulas for sample variance

just plug-in

=k

k

Sq

qSk f||

1

k

Sk

Sk

no parametersµk

=

−=K

k Sp

kpk

fSE1

2

),(

qf

pf

pf

K-means as variance clustering criteria

)var(||1

kK

k

k

K SSE ==

k

Sk

Sk

both objectives can be written as

qf

pf

pf

=> K-means is good for “compact blobs”

K-means – common extensions

=

−K

k Spdkp

k

f1

• Parametric methods with arbitrary likelihoods P( ˑ|θ)

(probabilistic K-means) [Kearns, Mansour & Ng, UAI’97]=

−K

k Sp

kpk

fP1

)|(log

Examples of P (ˑ|θ) : Gaussian, gamma, exponential, Gibbs, etc.

• Parametric methods with arbitrary distortion measure

(distortion clustering)

d||||

• Non-parametric (pairwise) methods with any kernel or affinity measure

(kernel K-means, average association, average distortion, normalized cut)

),( yxk

Examples of : quadratic absolute truncated(K-means) (K-medians) (K-modes)

d||||

Could be juxtaposed with GMM/EM as hard clustering via ML parameter fitting

2

~ kpf −

e.g. Gaussian 2

2

2

||||exp~

−− fP

replace dot-products by arbitrary kernel k

Probabilistic K-means Example:

Elliptic K-means

for Normal (Gaussian)distribution

(squared)Mahalanobis distance

Examples: a) Z - normal random vector with

m – meanΣ - covariance

b) X = AZ + m for arbitrary vector m and matrix A

distribution of XX = AZ + m


Elliptic K-means

Basic K-means




Elliptic K-means

Elliptic K-means




Entropy Clustering

=

−K

k Sp

kpk

fP1

)|(log

Monte-Carlo estimation formula

Using “optimal” distributions θk that minimizes cross entropy we get

entropy clustering criteria:

cross entropy

requires sufficiently descriptive (complex) class of probability models that can fit data well.

Probabilistic K-means:

summary

- model fitting (to data)- log-likelihood (model) parameter estimation

- complex data requires complex models

?

basic K-means works only for compact clusters (blobs)

that are linearly separable

from complex models

towards complex embeddings

From basic K-means to kernel K-means

(high-dimensional embedding story)

Example:

data can become linearly separableafter some non-linear embedding

(typically in high dimensional space)

for some (non-linear) embedding function

(explicit)K-means procedure:

(update at time t+1)

equivalent formulation:


dim(H)x|Ω| embedding matrix

where

- cluster k at iteration t

|Ω| indicatorvector for cluster k



Assume for now that such embedding is given

equivalent formulation:


Gram matrix

dot products








(implicit)kernel K-means procedure:


Requires only kernel matrix K , no need to know explicit embedding Φ

Gram matrix






equivalent


Kernel trick: start with (any ?) kernel K

If we start from given pairwise affinities (kernel matrix K), sometimes it may still be useful to

think about embedding implicitly defined by the kernel (via decomposition )

(Mercer theorem : any p.s.d. kernels can be decomposed that way)

Q: why even worry about embedding Φ when using kernel K-means procedure?

A: (HINT) Think about convergence. What do we minimize via kernel K-means procedure?

Kernel Trick: p.s.d. kernels K are a standard way to (implicitly) define

some high-dimensional embedding Φ (corresponding to decomposition )

Q: what is dimension of each Φp ?

Example: Gaussian kernel

Kernel-induced embedding:

- isometry

and the corresponding kernel-induced metric:

kernel defines an inner product:(in the original feature space)

Kerned-defined Euclidean embedding is isometric to

the original features with kernel-induced metric

NOTE:

Kernel-induced embedding:

- non-linear separation of original features

high-dimensional isometric embedding induced by kernel Kcan make clusters linearly separable

original feature space with kernel-induced metric

kernel-induced Euclidean embedding

Intuition for such “magic” behind commonly used kernels (e.g. Gaussian)?


(robust metric story)

kernel K-means objective:

remember

robust metric focuses on local distortion (deemphasizes larger distances)

basic (linear) kernel

Examples:

squared Euclidean distance

distance in standard K-means

Gaussian kernel

distance in Gaussian kernel K-means2||||

0

2|||| k



kernel K-means objective:

and kernel-induced metric:

remember

robust metric focuses on local distortion (deemphasizes larger distances)

2||||

0

2|||| k



S1 S2

σ

On importance of

positive-semi-definite (p.s.d.) kernels K

- Given any (e.g. non-p.s.d.) kernel, “diagonal shift” allows to formulate an equivalent kernel clustering objective with p.s.d. kernel (for sufficiently large scalar δ)

easy to verify equivalence of kernel K-means objectives for any scalar δwhile kernel K-means procedure is modified by the “shift” above

- (Mercer theorem) p.s.d. guarantees existence of explicit Euclidean embedding s.t.

that is

This allows to prove that implicit kernel K-means procedure convergesdue to its equivalence to convergent explicit K-means procedure for some

Weak kernel K-means

versus

?

Weak kernel K-means

versus

?

=

for(due to isometry)

=Each corresponds to some .(These two give the same solution S where two objectives are equal since embedding is isometric.)

The opposite is not true.

Weak kernel K-means

≥

Implicit search space for(higher-dimensional embedding space H)

is larger than search space for(original feature space)

=

=

for(due to isometry)

Weighted K-means and

Weighted kernel K-means

(unary) distortion between a point and a model

(pair-wise) distortion between two points

K=2

unary and pair-wise distortion clustering(general weighted case)

pKM

probabilistic K-means(ML model fitting)

kKM

kernel K-means(pairwise clustering)

basic K-means

p.d. kernel distance

weakkernel clustering

(unary) Hilbertian distortion

normalized cuts

K-modes(mean-shift)

GMM fitting

entropy clustering

elliptic K-means

gamma fitting

Gaussian kernel K-means

spectralratio cuts

distorton

Gibbs fitting

average cut

average association

averagedistortion

complex models complex embeddings(model parameter fitting)

(non-parametric)

Kernel Clustering

32

• kernel K-means, average association, Normalized Cuts, …• density biases: isolation of modes or sparse subsets• bound optimization

More on

kernel K-meansnon-parametric (kernel) clustering

fp

fq

Apq = k( fp ,fq )

- objective


fp

fq

Apq = k( fp ,fq )

- objective

explicit features fp are unnecessary


only need affinity (or kernel) matrix

A = [Apq ]

p

q

Ω - set of all points(graph nodes)

if necessary, “embedding” s.t.

can be found for p.s.d. A(via eigen decomposition)

as suggested by MERCER THEOREM


Ω - set of all points(graph nodes)

S1

S3

S2

kernel K-means or average associationnon-parametric (kernel) clustering

S1

S3

S2

“self-association” of cluster Sk


S1

S3

S2

in matrix notation:

Sk - indicator vector

‘ means transpose


e.g. for Gaussian kernel

kernel K-meansbasic K-meansfor

Why?

local “compactness”global “compactness”

?

Basic K-means vs Kernel Clustering

Basic K-means

Kernel Clustering

compact blobsin RGB space

segments inRGBXY space

(not blobs!)

color quantization

compact blobsin RGBXY space

[Achanta et al., PAMI 2011]

super-pixels

[Shi&Malik 2000]

segmentation


kernel K-means

e.g. for Gaussian kernel

density mode isolation[Marin et al. PAMI 2019]

inhomogeneous data density

“tight” clusters [Shi & Malik, PAMI 2000]

for “small” kernels (empirically observed)

by reduction to continuous Gini criterion

mode bias in discrete Gini criterion [Breiman, Machine Learning 1996]

discrete valued data

RGB features (no XY!)


properties for kernel bandwidths σ

0∞

r-small σ data range diameter

Breiman’s bias[Marin et al. PAMI 2019]

reduces to basic K-means“linear separation” and

“equi-cardinality” bias [Ng et al. UAI’96]


properties for kernel bandwidths σ

0∞

r-small σ data range diameter

there maybe no good (unbiased) solution in-between


A solution: density equalization[Marin et al. PAMI 2019]

Theorem: adaptive bandwidths for data in RN implicitly

transform data density r

Density Law (basic form)?no fixed bandwidth

will generate this result

?

Simple density equalization example

NOTE: same as a heuristicby Zelnik & Perona [NIPS 2004]for another clustering objective

average association withadaptive bandwidths σp

as above

Density Law

using

and standard density estimate

volume of ballcontaining KNN

KNN ball radius

S1

S3

S2

Other kernel (graph) clustering objectives

so far only looked at

“self-association” for Sk


“cut” for Sk

S1

S3

S2

S1

S3

S2

“self-association” for Sk

so far only looked at

Average Cut


• Ratio Cut, Cheeger cut, isoperimetric number, conductance

• spectral graph theory, electrical flows, random walks

S1

S3

S2


Average Cut


“node degree”

normalization

Average Cut

Normalized Cut


= Normalized Average Association

normalization

Average Cut

Normalized Cut

Summary of

common kernel clustering objectives

Average Cut

Normalized Cut = Normalized Average Association

normalization

Average Association(as discussed earlier)

bias to sparse subsets bias to density mode

?

Normalized Cut (NC) [Shi & Malik, 2000]

small bandwidth ( 2.47)

lack of non-linear separabilityNC cuts isolated points

large bandwidth ( 2.48)

no fixed bandwidth will generate the good result here

still has bias to sparse subset(the opposite of density mode)

Normalized Cut (NC) [Zelnik-Perona, 2004]

with density equalization(via locally adaptive bandwidths)

Average Association (kernel K-means)

gives a similar result for such p … why ?Question:

(locally adaptive kernel)

2𝑅𝐾𝑝

Average Cut, Normalized Cut, Average Association

Equivalence (after density equalization)

Avr. Assoc.(kernel K-means)

after density equalization[Marin et al. 2019]

Avr. Cut(Cheeger sets)

Norm. Cut

=c

Norm. Avr. Assoc.

For simplicity assume “KNN kernel”

(locally adaptive kernel)

2𝑅𝐾𝑝

Optimization

• block-coordinate descent (Lloyd’s algorithm)

• spectral relaxation

• bound optimization

Spectral Relaxation (quick overview)

In the context of kernel K-means (average association)

normalized (L2 norm is 1)cluster indicators



- |Ω| x K matrix

k-th element on

the diagonal ofmatrix - K x K matrix



k-th element of

original optimization problem (NP hard)

integrality of indicators Sk is relaxed,

optimization over a unit sphere.

relaxed problem (closed form solution)


optimization over a unit sphere.

relaxed problem (closed form solution)

(one of generalizations)

Rayleigh quotient problem

Closed form solution:

Z1 , Z2 , … , ZK are (unit) eigenvectors of matrix Acorresponding to its K largest eigenvalues

intuition: consider vector x

and so on…

Kernel clustering

via

bound optimization

Lloyd’s algorithm as bound optimization

64


Lloyd’s algorithm

corresponds to a block coordinate descent for K-means objective

t+1

Remember an equivalent objective minimizing out parameter m

where and


65


Lloyd’s algorithm


t+1

Remember an equivalent objective minimizing out parameter m


66


Lloyd’s algorithm


t+1

Remember an equivalent objective minimizing out parameter mlinear function w.r.t. S


67

E(S,mSt)

St St+1

E(S)


Lloyd’s algorithm

t+1

E(S,mSt+1)

compute new means mSt+1

≥ E(S) E(St ,mSt) = E(St)

optimal for bound E(S,mSt)

over binary indicators

guaranteed energy

decrease

E(S)

St St+1

E(St)

E(St+1)

Bound optimization, in general

At+1 (S)

At (S)

68

Kernel bound

Lemma 1 (concavity)

Function e : R|Ω| → R is concave over region Sk > 0 given p.s.d. affinity matrix A := [Apq].

at(S)(I)

(II)

at+1(S)e(Sk)

A bound is given by the first-order Taylor expansion:

NOTE: optimizing this unary bound for KC (alone) is equivalent to iterative kernel k-means a la [Lloyd’57]

(intuition came from observation: Lloyd’s algorithm unary bound optimization)

Kernel bound

(approximate)

Spectral bound

Main idea:

- standard eigen analysis (PCA) gives low dimensional embedding

or (equivalently) low rank matrix such that

minimizing Frobenius error

- use linear bound from lemma 1 for kernel matrix

NOTE: optimizing such unary bound for KC (alone) isequivalent to iterative k-means [Lloyd algorithm] for embedding

a la discretization heuristic after spectral relaxation methods

(approximate)

Spectral bound

Empirical motivation for low dimensional spectral approximation:

NOTE: optimizing this unary bound alone for KC (without regularization) is similar to discretization heuristic (K-means) for spectral relaxation [Shi&Malik, 2000]

approx KC energy (low dimensions)exact KC energy

progression of iterative (kernel) K-means algorithm [Lloyd]

approx KC energy (high dimensions)exact KC energy

topic 1 clustering basics · topic 1 clustering basics cs898. overview basics (k-means) •...

Documents