db seminar series: semi- supervised projected clustering by: kevin yip (4 th may 2004)
TRANSCRIPT
DB Seminar Series: Semi-supervised Projected Clustering
By: Kevin Yip (4th May 2004)
Outline
Introduction– Projected clustering– Semi-supervised clustering
Our problemOur new algorithmExperimental resultsFuture works and extensions
Projected Clustering
Where are the clusters?
Projected Clustering
Where are the clusters?
Projected Clustering
Pattern-based projected cluster:
1 2 3 4 5
Series1 Series2 Series3 Series4
Projected Clustering
Goal: to discover clusters and their relevant dimensions that optimize a certain objective function.Previous approaches:– Partitional: PROCLUS, ORCLUS– One cluster at a time: DOC, FastDOC, MineClus– Hierarchical: HARP
Projected Clustering
Limitations of the approaches:– Cannot detect clusters of extremely low
dimensionalities (clusters with low percentage of relevant dimensions, e.g. only 5% of input dimensions are relevant)
– Require the input of parameter values that are hard for users to supply
– Performance sensitive to the parameter values– High time complexity
Semi-supervised Clustering
In some applications, there is usually a small amount of domain knowledge available (e.g. the functions of 5% of the genes probed on a microarray).The knowledge may not be suitable/sufficient for carrying out classification.Clustering algorithms make little use of external knowledge.
Semi-supervised Clustering
The idea of semi-supervised clustering:– Use the models implicitly assumed behind a
clustering algorithm (e.g. compact hypersphere of k-means, density-connected irregular regions of DBScan)
– Use external knowledge to guide the tuning of model parameters (e.g. location of cluster centers)
Semi-supervised Clustering
Why not clustering?– The clusters produced may not be the ones required.
– There could be multiple possible groupings.– There is no way to utilize the domain knowledge that
is accessible (active learning v.s. passive validation).
(Guha et al., 1998)
Semi-supervised Clustering
Why not classification?– There is insufficient labeled data:
• Objects are not labeled.• The amount of labeled objects is statistically insignificant.• The labeled objects do not cover all classes.• The labeled objects of a class do not cover all cases (e.g.
they are all found at one side of a class).– It is not always possible to find a classification
method with an underlying model that fits the data (e.g. pattern-based similarity).
Our Problem
Data Model:– The input dataset has n objects and d dimensions– The dataset contains k disjoint clusters, and possibly some
outlier objects– Each cluster is associated with a set of relevant dimensions– If a dimension is relevant to a cluster, the projections of the
cluster members on the dimension are random samples of a local Gaussian distribution
– Other projections are random samples of a global distribution (e.g. uniform distribution or Gaussian distribution with a standard deviation much larger than those of the local distributions)
Our Problem
Resulting data: if a dimension is relevant to a cluster, the projections of its members on the dimension will be close to each other (the within-cluster variance much smaller than irrelevant dimensions).Example:
X Y ZC1 N(5, 1) N(5, 1) N(5, 1)C2 N(8, 1) N(6, 1) U(0, 10)C3 U(0, 10) U(0, 10) U(0, 10)
Our Problem
X Y Z
C1 N(5, 1) N(5, 1) N(5, 1)
C2 N(8, 1) N(6, 1) U(0, 10)
C3 U(0, 10) U(0, 10) U(0, 10)
Our Problem
X Y Z
C1 N(5, 1) N(5, 1) N(5, 1)
C2 N(8, 1) N(6, 1) U(0, 10)
C3 U(0, 10) U(0, 10) U(0, 10)
Our Problem
X Y Z
C1 N(5, 1) N(5, 1) N(5, 1)
C2 N(8, 1) N(6, 1) U(0, 10)
C3 U(0, 10) U(0, 10) U(0, 10)
Our Problem
X Y Z
C1 N(5, 1) N(5, 1) N(5, 1)
C2 N(8, 1) N(6, 1) U(0, 10)
C3 U(0, 10) U(0, 10) U(0, 10)
Our Problem
Problem definition:– Inputs:
• The dataset D• The target number of clusters k• A (possibly empty) set Io of labeled objects (obj. ID, class
label), which may or may not cover all classes• A (possibly empty) set Iv of labeled relevant dimensions
(dim. ID, class label), which may or may not cover all classes. A single dimension can be specified as relevant to multiple clusters
Our Problem
Problem definition (cont’d):– Outputs:
• A set of k disjoint projected clusters with a (locally) optimal objective score
• A (possibly empty) set of outlier objects
Our Problem
Assumptions made in this study:– There is a primary clustering target (c.f. biclustering)– Disjoint, axis-parallel clusters (c.f. subspace
clustering and ORCLUS)– Distance-based similarity– One cluster per class (c.f. decision tree)– All inputs are correct (but can be biased, i.e., with
projections on the relevant dimensions deviated from the cluster center)
Our New Algorithm
Basic idea: k-medoid/median1. Determine the potential medoids (seeds) and relevant
dimensions of each cluster2. Assign every object to the cluster (or to the outlier list) that
gives the greatest improvement to the objective score3. Decide which medoids are good/bad
– A good medoid: replace by cluster median, refine selected dimensions
– A bad medoid: replace by another seed
4. Repeat 2 and 3 until no improvements can be obtained in a certain number of iterations
Our New Algorithm
Issues to consider:– Design of the objective function– Selection of relevant dimensions for a cluster– Determination of seeds and the relevant dimensions
of the corresponding potential clusters– Replacement of medoids
Our New Algorithm
Design goals of the objective function:– Should not have a trivial best score (e.g. when each
cluster selects only one dimension)– Should not be ruined by the selection of a small
amount of irrelevant dimensions– Should be robust (clustering accuracy should not
degrade seriously when the input parameter values are not very accurate)
Our New Algorithm
The objective function:– Overall score:– Score component
of cluster Ci:– Contribution of selected
dimension vj on the scorecomponent of Ci:
– : normalization factor
2
2
2
2
1
ˆ11
ˆ1
1
1
ij
iji
Cxijj
ijiij
Vviji
k
ii
n
xn
nd
i
ij
2ˆ ij
Our New Algorithm
Characteristics of the objective function:– Higher score => better clustering– No trivial best score when each cluster selects only
one or selects all dimensions– Relevant dimensions (dimensions with smaller )
constitute more to the objective score– Robust? (To be discussed soon…)
2ij
Our New Algorithm
Dimension selection:– In order to maximize ,
all dimensions with should be selected.– Appropriate values of :
• Should be at least j2, the global variance of dimension vj
• Scheme 1: • Scheme 1b: , but only dimensions with
are selected => easier to compare the results with different m
22 ˆ ijij 2ˆ ij
10where ,ˆˆ 22 mm jij 22 ˆˆ jij
ijij Vv ij
iji
Vviji n
2
2
ˆ11
22 ˆ jij m
Our New Algorithm
Scheme 2: estimate the probability for an irrelevant dimension to be selected (global distribution needs to be known). If the global distribution is Gaussian…– If ni values are randomly sampled from the global distribution
of an irrelevant dimension vj, the random variable (ni-1)ij2/ j
2 has a chi-square distribution with ni-1 degrees of freedom.
– Suppose we want the probability of selecting an irrelevant dimension to be p, then From the cumulative chi-square distribution, the corresponding can be computed.
)ˆPr( 22ijijp
2ˆ ij
Our New Algorithm
Probability density function and cumulative distribution (ni=30):
0 20 40 60 80
(ni-1)ij2/ i
2
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50
(ni-1)ij2/ i
2
(30-1)ij2/ i
2 19=> ij
2 0.66i2 (= mi
2)
Our New Algorithm
Robustness of the algorithm:– A good value of m should be…
• Large enough to tolerate local variances• Small enough to distinguish local
variances from global variances– The best value to use is data-
dependent, but provided the difference between local and global variances is large, there is usually a wide range of values that lead to results with acceptable performance (e.g. 0.3 < m < 0.7)
Our New Algorithm
Determination of seeds and the relevant dimensions of the corresponding potential clusters:– Traditional approach:
• One seed pool• Seeds determined randomly/by max-min distance
method/by preclustering (e.g. hierarchical)• Relevant dimensions of each cluster determined by a set
of objects near the medoid (in the input space)
Our New Algorithm
Our proposal – seed group:– Seeds are stored in separate seed groups, each seed group
contains a small number (e.g. 5) seeds– One private seed group for each cluster with some inputs– A number of public seed groups are shared by all clusters
without external inputs– The seeds of the cluster with the largest amount of inputs are
initialized first (as we are most confident in their correctness), and then those with less inputs, and so on. Finally, the public seed groups are initialized.
Our New Algorithm
Our proposal – seeds selection:– Based on low-dimensional
histograms (grids)– Relevant dimension
=> small variance=> high density
– Procedures:• Determine starting point• Hill-climbing
– => Need to determine both the dimensions used in constructing the grid and the starting point
Our New Algorithm
Determining the grid-constructing dimensions and the starting point:– Case 1: a cluster with both labeled objects and
labeled relevant dimensions– Case 2: a cluster with only labeled objects– Case 3: a cluster with only labeled relevant
dimensions– Case 4: a cluster with no inputs
Our New Algorithm
Case 1: both kinds of inputs are available1. Form a seed cluster by the input objects2. Rank all dimensions by3. All dimensions with positive ij or in the input set Iv
are candidate dimensions for constructing the histograms
4. Relative chance of being selected: ij if dimension vj is not in Iv, 1 otherwise
5. The starting point is the median of the seed cluster
22 ˆ/11 ijijiij n
Our New Algorithm
Example: cluster 2 2x: 0.68
2y: 0.83
2z: -0.02
The hill-climbing mechanismfixes errors due to biasedinputs
Our New Algorithm
Case 2: labeled objects only– Similar to case 1, but the chance for each dimension
to be selected is based on ij only
Case 3: labeled dimensions only– Similar to case 1, but with no staring point, i.e., all
cells are examined, and the one with the highest density will be returned
Our New Algorithm
Case 4: no inputs– The tentative seed is the one with the maximum projected
distance to the closest selected seeds (modified max-min distance method)
– For each dimension, an one-dimensional histogram is constructed to determine the density of objects around the projection of the tentative seed
– The chance for each dimension being selected to construct the grid is based on the density
– The tentative seed is used as the starting point
Our New Algorithm
Medoid drawing/replacement:– The medoid of each cluster is initially drawn from…
• The corresponding private seed group, if available• A unique public seed group, otherwise
– After assignment, the medoid for cluster Ci is likely to be a bad one if…
• (i / maxi I) is small – the cluster has a low quality as compared to other clusters
• (i / max I) is small – the cluster has a low quality as compared to a perfect cluster
• The cluster is very similar to another cluster
Our New Algorithm
Medoid drawing/replacement (cont’d):– Each time, only one potential bad medoid is replaced since
the probability of simultaneously correcting multiple medoids is low
– The target bad medoid is replaced by a seed from the corresponding private seed group or a new public seed group
– The medoids of other clusters are replaced by cluster medoids, and the relevant dimensions are reselected
– The algorithm keeps track on the best set of medoids and relevant dimensions
Experimental Results
Dataset 1: n=1000, d=100, k=5, lreal=5-40 (5-40% of d)No external inputsAlgorithms:– HARP– PROCLUS– SSPC– CLARANS (non-projected control)
Experimental Results
Best performance (results with the best ARI values):
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40
real l
AR
I
CLARANS
HARP
PROCLUS best
SSPC m best
SSPC p best
Experimental Results
Best performance v.s. average performance:
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40
real l
AR
I
PROCLUS avg
PROCLUS best
SSPC m avg
SSPC m best
SSPC p avg
SSPC p best
0.3 0.4 0.5 0.6 0.7
0
0.2
0.4
0.6
0.8
1
Parameter value
AR
I
SSPC m
0.01 0.03 0.05 0.07 0.09
0
0.2
0.4
0.6
0.8
1
Parameter value
AR
I
SSPC p
10
20
3040 50
60 7080
900
0.2
0.4
0.6
0.8
1
Parameter value
AR
I
PROCLUS
Experimental Results
Robustness (lreal=10):
Experimental Results
Dataset 2: n=150, d=3000, k=5, lreal=30 (1% of d)
Inputs:– Io size, Iv size=1-9– 4 combinations: both, labeled objects only, labeled
relevant dimensions only, none– Coverage: 1-5 clusters (20-100%)
Experimental Results
Increasing input size (100% coverage):
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 1 2 3 4 5 6 7 8 9
Input size
AR
I
Io only
Iv only
Both
Experimental Results
Increasing coverage (input size=3):
0.4
0.5
0.6
0.7
0.8
0.9
0% 20% 40% 60% 80% 100%
Coverage
AR
I
Io only 3
Iv only 3
Both 3
Experimental Results
Increasing coverage (input size=6):
0.4
0.5
0.6
0.7
0.8
0.9
0% 20% 40% 60% 80% 100%
Coverage
AR
I
Io only 6
Iv only 6
Both 6
Future Works and Extensions
Other required experiments:– Biased inputs– Multiple labeling methods for a single dataset– Scalability– Real data– Imperfect data with artificial outliers and errors– Searching for the best k
Future Works and Extensions
To be considered in the future:– Other input types (e.g. must-links and cannot-links)– Wrong/Inconsistent inputs– Pattern-based and range-based similarity– Non-disjoint clusters
References
Projected clustering:– HARP: A Hierarchical Algorithm with Automatic Rele
vant Attribute Selection for Projected Clustering
(DB Seminar on 20 Sep 2002)
Semi-supervised clustering:– The Semi-supervised Clustering Problem
(DB Seminar on 2 Jan 2004)