db seminar series: semi- supervised projected clustering by: kevin yip (4 th may 2004)

DB Seminar Series: Semi-supervised Projected Clustering

By: Kevin Yip (4th May 2004)

Outline

Introduction– Projected clustering– Semi-supervised clustering

Our problemOur new algorithmExperimental resultsFuture works and extensions

Projected Clustering

Where are the clusters?


Pattern-based projected cluster:

1 2 3 4 5

Series1 Series2 Series3 Series4


Goal: to discover clusters and their relevant dimensions that optimize a certain objective function.Previous approaches:– Partitional: PROCLUS, ORCLUS– One cluster at a time: DOC, FastDOC, MineClus– Hierarchical: HARP


Limitations of the approaches:– Cannot detect clusters of extremely low

dimensionalities (clusters with low percentage of relevant dimensions, e.g. only 5% of input dimensions are relevant)

– Require the input of parameter values that are hard for users to supply

– Performance sensitive to the parameter values– High time complexity

Semi-supervised Clustering

In some applications, there is usually a small amount of domain knowledge available (e.g. the functions of 5% of the genes probed on a microarray).The knowledge may not be suitable/sufficient for carrying out classification.Clustering algorithms make little use of external knowledge.


The idea of semi-supervised clustering:– Use the models implicitly assumed behind a

clustering algorithm (e.g. compact hypersphere of k-means, density-connected irregular regions of DBScan)

– Use external knowledge to guide the tuning of model parameters (e.g. location of cluster centers)


Why not clustering?– The clusters produced may not be the ones required.

– There could be multiple possible groupings.– There is no way to utilize the domain knowledge that

is accessible (active learning v.s. passive validation).

(Guha et al., 1998)


Why not classification?– There is insufficient labeled data:

• Objects are not labeled.• The amount of labeled objects is statistically insignificant.• The labeled objects do not cover all classes.• The labeled objects of a class do not cover all cases (e.g.

they are all found at one side of a class).– It is not always possible to find a classification

method with an underlying model that fits the data (e.g. pattern-based similarity).

Our Problem

Data Model:– The input dataset has n objects and d dimensions– The dataset contains k disjoint clusters, and possibly some

outlier objects– Each cluster is associated with a set of relevant dimensions– If a dimension is relevant to a cluster, the projections of the

cluster members on the dimension are random samples of a local Gaussian distribution

– Other projections are random samples of a global distribution (e.g. uniform distribution or Gaussian distribution with a standard deviation much larger than those of the local distributions)

Our Problem

Resulting data: if a dimension is relevant to a cluster, the projections of its members on the dimension will be close to each other (the within-cluster variance much smaller than irrelevant dimensions).Example:

X Y ZC1 N(5, 1) N(5, 1) N(5, 1)C2 N(8, 1) N(6, 1) U(0, 10)C3 U(0, 10) U(0, 10) U(0, 10)

Our Problem

X Y Z

C1 N(5, 1) N(5, 1) N(5, 1)

C2 N(8, 1) N(6, 1) U(0, 10)

C3 U(0, 10) U(0, 10) U(0, 10)

Our Problem

Problem definition:– Inputs:

• The dataset D• The target number of clusters k• A (possibly empty) set Io of labeled objects (obj. ID, class

label), which may or may not cover all classes• A (possibly empty) set Iv of labeled relevant dimensions

(dim. ID, class label), which may or may not cover all classes. A single dimension can be specified as relevant to multiple clusters

Our Problem

Problem definition (cont’d):– Outputs:

• A set of k disjoint projected clusters with a (locally) optimal objective score

• A (possibly empty) set of outlier objects

Our Problem

Assumptions made in this study:– There is a primary clustering target (c.f. biclustering)– Disjoint, axis-parallel clusters (c.f. subspace

clustering and ORCLUS)– Distance-based similarity– One cluster per class (c.f. decision tree)– All inputs are correct (but can be biased, i.e., with

projections on the relevant dimensions deviated from the cluster center)

Our New Algorithm

Basic idea: k-medoid/median1. Determine the potential medoids (seeds) and relevant

dimensions of each cluster2. Assign every object to the cluster (or to the outlier list) that

gives the greatest improvement to the objective score3. Decide which medoids are good/bad

– A good medoid: replace by cluster median, refine selected dimensions

– A bad medoid: replace by another seed

4. Repeat 2 and 3 until no improvements can be obtained in a certain number of iterations

Our New Algorithm

Issues to consider:– Design of the objective function– Selection of relevant dimensions for a cluster– Determination of seeds and the relevant dimensions

of the corresponding potential clusters– Replacement of medoids

Our New Algorithm

Design goals of the objective function:– Should not have a trivial best score (e.g. when each

cluster selects only one dimension)– Should not be ruined by the selection of a small

amount of irrelevant dimensions– Should be robust (clustering accuracy should not

degrade seriously when the input parameter values are not very accurate)

Our New Algorithm

The objective function:– Overall score:– Score component

of cluster Ci:– Contribution of selected

dimension vj on the scorecomponent of Ci:

– : normalization factor

2

2

2

2

1

ˆ11

ˆ1

1

1

ij

iji

Cxijj

ijiij

Vviji

k

ii

n

xn

nd

i

ij

2ˆ ij

Our New Algorithm

Characteristics of the objective function:– Higher score => better clustering– No trivial best score when each cluster selects only

one or selects all dimensions– Relevant dimensions (dimensions with smaller )

constitute more to the objective score– Robust? (To be discussed soon…)

2ij

Our New Algorithm

Dimension selection:– In order to maximize ,

all dimensions with should be selected.– Appropriate values of :

• Should be at least j2, the global variance of dimension vj

• Scheme 1: • Scheme 1b: , but only dimensions with

are selected => easier to compare the results with different m

22 ˆ ijij 2ˆ ij

10where ,ˆˆ 22 mm jij 22 ˆˆ jij

ijij Vv ij

iji

Vviji n

2

2

ˆ11

22 ˆ jij m

Our New Algorithm

Scheme 2: estimate the probability for an irrelevant dimension to be selected (global distribution needs to be known). If the global distribution is Gaussian…– If ni values are randomly sampled from the global distribution

of an irrelevant dimension vj, the random variable (ni-1)ij2/ j

2 has a chi-square distribution with ni-1 degrees of freedom.

– Suppose we want the probability of selecting an irrelevant dimension to be p, then From the cumulative chi-square distribution, the corresponding can be computed.

)ˆPr( 22ijijp

2ˆ ij

Our New Algorithm

Probability density function and cumulative distribution (ni=30):

0 20 40 60 80

(ni-1)ij2/ i

2

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50

(ni-1)ij2/ i

2

(30-1)ij2/ i

2 19=> ij

2 0.66i2 (= mi

2)

Our New Algorithm

Robustness of the algorithm:– A good value of m should be…

• Large enough to tolerate local variances• Small enough to distinguish local

variances from global variances– The best value to use is data-

dependent, but provided the difference between local and global variances is large, there is usually a wide range of values that lead to results with acceptable performance (e.g. 0.3 < m < 0.7)

Our New Algorithm

Determination of seeds and the relevant dimensions of the corresponding potential clusters:– Traditional approach:

• One seed pool• Seeds determined randomly/by max-min distance

method/by preclustering (e.g. hierarchical)• Relevant dimensions of each cluster determined by a set

of objects near the medoid (in the input space)

Our New Algorithm

Our proposal – seed group:– Seeds are stored in separate seed groups, each seed group

contains a small number (e.g. 5) seeds– One private seed group for each cluster with some inputs– A number of public seed groups are shared by all clusters

without external inputs– The seeds of the cluster with the largest amount of inputs are

initialized first (as we are most confident in their correctness), and then those with less inputs, and so on. Finally, the public seed groups are initialized.

Our New Algorithm

Our proposal – seeds selection:– Based on low-dimensional

histograms (grids)– Relevant dimension

=> small variance=> high density

– Procedures:• Determine starting point• Hill-climbing

– => Need to determine both the dimensions used in constructing the grid and the starting point

Our New Algorithm

Determining the grid-constructing dimensions and the starting point:– Case 1: a cluster with both labeled objects and

labeled relevant dimensions– Case 2: a cluster with only labeled objects– Case 3: a cluster with only labeled relevant

dimensions– Case 4: a cluster with no inputs

Our New Algorithm

Case 1: both kinds of inputs are available1. Form a seed cluster by the input objects2. Rank all dimensions by3. All dimensions with positive ij or in the input set Iv

are candidate dimensions for constructing the histograms

4. Relative chance of being selected: ij if dimension vj is not in Iv, 1 otherwise

5. The starting point is the median of the seed cluster

22 ˆ/11 ijijiij n

Our New Algorithm

Example: cluster 2 2x: 0.68

2y: 0.83

2z: -0.02

The hill-climbing mechanismfixes errors due to biasedinputs

Our New Algorithm

Case 2: labeled objects only– Similar to case 1, but the chance for each dimension

to be selected is based on ij only

Case 3: labeled dimensions only– Similar to case 1, but with no staring point, i.e., all

cells are examined, and the one with the highest density will be returned

Our New Algorithm

Case 4: no inputs– The tentative seed is the one with the maximum projected

distance to the closest selected seeds (modified max-min distance method)

– For each dimension, an one-dimensional histogram is constructed to determine the density of objects around the projection of the tentative seed

– The chance for each dimension being selected to construct the grid is based on the density

– The tentative seed is used as the starting point

Our New Algorithm

Medoid drawing/replacement:– The medoid of each cluster is initially drawn from…

• The corresponding private seed group, if available• A unique public seed group, otherwise

– After assignment, the medoid for cluster Ci is likely to be a bad one if…

• (i / maxi I) is small – the cluster has a low quality as compared to other clusters

• (i / max I) is small – the cluster has a low quality as compared to a perfect cluster

• The cluster is very similar to another cluster

Our New Algorithm

Medoid drawing/replacement (cont’d):– Each time, only one potential bad medoid is replaced since

the probability of simultaneously correcting multiple medoids is low

– The target bad medoid is replaced by a seed from the corresponding private seed group or a new public seed group

– The medoids of other clusters are replaced by cluster medoids, and the relevant dimensions are reselected

– The algorithm keeps track on the best set of medoids and relevant dimensions

Experimental Results

Dataset 1: n=1000, d=100, k=5, lreal=5-40 (5-40% of d)No external inputsAlgorithms:– HARP– PROCLUS– SSPC– CLARANS (non-projected control)


Best performance (results with the best ARI values):

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

real l

AR

I

CLARANS

HARP

PROCLUS best

SSPC m best

SSPC p best


Best performance v.s. average performance:

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

real l

AR

I

PROCLUS avg

PROCLUS best

SSPC m avg

SSPC m best

SSPC p avg

SSPC p best

0.3 0.4 0.5 0.6 0.7

0

0.2

0.4

0.6

0.8

1

Parameter value

AR

I

SSPC m

0.01 0.03 0.05 0.07 0.09

0

0.2

0.4

0.6

0.8

1

Parameter value

AR

I

SSPC p

10

20

3040 50

60 7080

900

0.2

0.4

0.6

0.8

1

Parameter value

AR

I

PROCLUS


Robustness (lreal=10):


Dataset 2: n=150, d=3000, k=5, lreal=30 (1% of d)

Inputs:– Io size, Iv size=1-9– 4 combinations: both, labeled objects only, labeled

relevant dimensions only, none– Coverage: 1-5 clusters (20-100%)


Increasing input size (100% coverage):

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1 2 3 4 5 6 7 8 9

Input size

AR

I

Io only

Iv only

Both


Increasing coverage (input size=3):

0.4

0.5

0.6

0.7

0.8

0.9

0% 20% 40% 60% 80% 100%

Coverage

AR

I

Io only 3

Iv only 3

Both 3


Increasing coverage (input size=6):

0.4

0.5

0.6

0.7

0.8

0.9

0% 20% 40% 60% 80% 100%

Coverage

AR

I

Io only 6

Iv only 6

Both 6

Future Works and Extensions

Other required experiments:– Biased inputs– Multiple labeling methods for a single dataset– Scalability– Real data– Imperfect data with artificial outliers and errors– Searching for the best k

Future Works and Extensions

To be considered in the future:– Other input types (e.g. must-links and cannot-links)– Wrong/Inconsistent inputs– Pattern-based and range-based similarity– Non-disjoint clusters

References

Projected clustering:– HARP: A Hierarchical Algorithm with Automatic Rele

vant Attribute Selection for Projected Clustering

(DB Seminar on 20 Sep 2002)

Semi-supervised clustering:– The Semi-supervised Clustering Problem

(DB Seminar on 2 Jan 2004)

db seminar series: semi- supervised projected clustering by: kevin yip (4 th may 2004)

Documents