ch. eick et al.: using clustering to learn distance functions mldm 2005 using clustering to learn...

Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005

Using Clustering to Learn Distance Functions for Supervised Similarity Assessment

Christoph F. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta Department of Computer Science

University of Houston

Organization of the Talk1. Similarity Assessment

2. A Framework for Distance Function Learning

3. Inside Outside Weight Updating

4. Distance Function Learning Research at UH-DMML

5. Experimental Evaluation

6. Other Distance Function Learning Research

7. Summary


1. Similarity Assessment

Definition: Similarity assessment is the task of determining which objects are similarto each other and which are dissimilar to each other.

Goal of Similarity Assessment: Construct a distance function!

Applications of Similarity Assessment:• Case-based reasoning• Classification techniques that rely on distance functions• Clustering• …

Complications: • Usually, there is no universal “good” distance function for a set of objects; the usefulness of a distance depends on the task it used for (“no free lunch in similarity assessment either”).• Defining the distance between objects is more an art than a science.


The following relation is given (with 10000 tuples):Patient(ssn, weight, height, cancer-sev, eye-color, age,…)• Attribute Domains

– ssn: 9 digits

– weight between 30 and 650; mweight=158 sweight=24.20

– height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2

– cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor

– eye-color: {brown, blue, green, grey }

– age: between 3 and 100; mage=45 sage=13.2

Task: Define Patient Similarity

Motivating Example: How To Find Similar Patients?


Data Extraction Tool

DBMS

Clustering Tool

User Interface

A set of clusters

Similarity measure

Similarity Measure Tool

Default choices and domain information

Library of similarity measures

Type and weight

information

ObjectView

Library of clustering algorithms

CAL-FULL/UH Database Clustering & Similarity Assessment Environments

CAL-FULL/UH Database Clustering & Similarity Assessment Environments

LearningTool

TrainingData

Today’stopic

For more details: see [RE05]


2. A Framework for Distance Function Learning

• Assumption: The distance between two objects is computed as the weighted sum of the distances with respect to their attributes.

• Objective: Learn a “good” distance function for classification tasks.• Our approach: Apply a clustering algorithm with the object distance

function to be evaluated that returns k clusters. • Our goal is to learn the weights of an object distance function such that

pure clusters are obtained (or as pure is possible) --- a pure cluster contains example belonging to a single class.

f

fjif

ji

wpf

woopfoo

1

*)(1

,),(


Idea: Coevolving Clusters and Distance Functions

Clustering X DistanceFunction Cluster

Goodness of the Distance Function

q(X) Clustering Evaluation

Weight Updating Scheme /Search Strategy

x

x x

x

o

oo

o

xx

oo

xx

oo

oo

“Bad” distance function “Good” distance function

xx

oo


3. Inside/Outside Weight Updating

Cluster1: distances with respect to Att1

Action: Increase weight of Att1

Action: Decrease weight for Att2

Cluster1: distances with respect to Att2Idea: Move examples of the majority class closer to each other

xo oo ox

o o xx o o

o:=examples belonging to majority classx:= non-majority-class examples


Inside/Outside Weight Updating Algorithm

1. Cluster the dataset using a given weight vector w=(w1,…,wp) using k-means

2. FOR EACH cluster-attribute pair DO

1. Modify w using inside/outside weight updating

3. IF NOT DONE, CONTINUE with Step1; OTHERWISE, RETURN w.


Inside/Outside Weight Updating Heuristic

0.3) (e.g. rate learning:

i attribute respect toth cluster wi in the objects classmajority of distance average :μ

i attribute respect toth cluster wi in the objects all of distance average :σ

*)('

i

i

iii

wiw

o o xx o o xo oo oxExample 1: Example 2:

(W)

The weight of the i-th attribute wi is updated as follows for a given cluster:


Idea: Weight Inside/Outside Weight Updating

1

2

3

45

6

Clusterk

Attribute1 Attribute2 Attribute3

Initial Weights: w1=w2=w3=1; Updated Weights: w1=1.14,w2=1.32, w3=0.84


Illustration: Net Effect of Weight Adjustments

New Object Distances Old Object Distances

1

2

3

45

6

Clusterk


A Slight Enhanced Weight Update Formula

sizecluster average over the sizecluster theis :

0.3) (e.g. rate learning:

i attribute respect toth cluster wi in the objects classmajority of distance average :μ

i attribute respect toth cluster wi in the objects all of distance average :σ

)(W' **)('

i

i

iii

wiw


Sample Run of IOWU for the Diabetes Dataset


4. Distance Function Learning Research at UH-DMML

RandomizedHill Climbing

AdaptiveClustering

Inside/OutsideWeight Updating

K-Means

SupervisedClustering

NN-Classifier

Weight-Updating Scheme /Search Strategy

Distance FunctionEvaluation

… …

WorkBy Karypis

[BECV05]

Other Research

[ERBV04]

CurrentResearch[EZZ04]


5. Experimental Evaluation

• Used a benchmark consisting of 7/15 UCI datasets

• Inside/outside weight updating was run for 200 iterations• was set to 0.3

• Evaluation (10-fold cross validation repeated 10 times was used to determine accuracy)

– Used 1-NN classifier as the base line classifer

– Usee the learned distance function for a 1-NN

– Used the learned distance function for a NCC classifier (new!)


NCC-Classifier

A

C

E

a. Dataset clustered by K-means b. Dataset edited using cluster centroids that carry the class label of the cluster majority class

Attribute1

D

B

Attribute2

F

Attribute2

Attribute1

Idea: the training set is replaced by k (centroid, majority class) pairs that are computed using k-means; the so generated dataset is then used to classify the examples in thetest set.


Experimental Evaluation

Dataset n k 1-NN LW1NN NCC C4.5

DIABETES 768 35 70.62 68.89 73.07 74.49

VEHICLE 846 64 69.59 69.86 65.94 72.28

HEART-STATLOG 270 10 76.15 77.52 81.07 78.15

GLASS 214 30 69.95 73.5 66.41 67.71

HEART-C 303 25 76.06 76.39 78.77 76.94

HEART-H 294 25 78.33 77.55 81.54 80.22

IONOSPHERE 351 10 87.1 91.73 86.73 89.74

Remark: Statistically significant improvements are in red.


DF-Learning With Randomized Hill Climbing

Random: random number

: rate of change

for example:[-0.3,0.3]

0.3

-0.3

• Generate R solutions in the neighborhood of w and pick the best one to be the new weight vector w

)*)0,1(1(*' Randomwwii


Accuracy IOWA and Randomized Hill Climbing

Dataset RHC(1c) RHC(2c) RHC(5c) IOWA(1c) IOWA(2c) IOWA(5c)

autos 48.21 46.66 38.32 40.94 45.70 41.39

breast-cancer 70.09 73.05 71.04 71.85 73.21 71.49

wisconsin-breast-cancer 94.47 96.24 95.06 94.41 96.67 94.03

credit-rating 53.17 47.17 44.59 53.28 49.14 45.88

pima_diabetes 71.56 73.91 73.24 72.11 73.80 74.22

german_credit 69.50 71.31 72.48 67.41 68.89 70.47

Glass 61.24 64.56 62.32 61.16 63.38 61.41

cleveland-14-heart-diseas 77.89 74.87 71.20 77.33 73.39 67.30

hungarian-14-heart-diseas 80.94 80.09 78.45 79.77 79.62 76.78

heart-statlog 82.33 81.67 76.37 82.15 81.78 77.52

ionosphere 82.74 85.75 86.17 85.25 89.72 89.57

sonar 70.70 71.97 73.68 71.70 72.67 73.43

vehicle 56.25 56.25 58.31 53.51 56.36 55.48

vote 94.67 90.54 88.84 93.68 94.21 89.05

zoo 78.97 67.19 56.11 79.20 68.75 53.80


• Uses reinforcement learning to adapt distance functions for k-means clustering.

• Employs search strategies that explores multiple paths in parallel. The algorithm maintains an open-list with maximum size |L| --- bad performers a dropped from the open list. Currently, beam search is used which creates 2p successors (increasing and decreasing the weight of each attribute exactly once) and evaluates those 2p*|L| successors and keeps the best |L| of those.

• Discretizes the search space in which states are (<weights>,<centroids>) tuples into a grid, and memorizes and updates the fitness values of the grid; value iteration is limited to “interesting states” by employing prioritized sweeping.

• Weights are updated by increasing / decreasing the weight of an attribute by a randomly chosen percentage that fall within an interval [min-change, max-change]; our current implementation uses: [25%,50%].

• Employs entropy H(X) as the fitness function (low entropy pure cluster)

Distance Function Learning With Adaptive Clustering


6. Related Distance Function Learning Research

• Interactive approaches that use user feedback and reinforcement learning to derive a good distance function.

• Other work uses randomized hill climbing and neural networks to learn distance functions for classification tasks; mostly, NN-queries are used to evaluate the quality of a clustering.

• Other work, mostly in the area of semi-supervised clustering, adapts object distances to cope with constraints.


7. Summary

• Described an approach that employs clustering for distance function evaluation.

• Introduced an attribute weight updating heuristic called inside/outside weight-updating and evaluated its performance.

• The inside/weight updating approach enhanced a 1-NN classifier significantly for some UCI datasets, but not for all data sets that were tested.

• The quality of the employed approach is dependent on the number of cluster k which is an input parameter; our current research centers on determining k automatically with a supervised clustering algorithm [EZZ04]

• The general idea to replace a dataset by cluster representatives to enhance NN-classifiers shows a lot of promise in this (as exemplified in the NCC classifier) and other research we are currently conducting.

• Distance function learning is quite time consuming; one run of 200 iterations of inside/outside weight updating takes between 5 seconds and 5 minutes depending on dataset size and k-value; other techniques we currently investigate are significantly slower; therefore, we are currently moving to high performance computing facilities for the empirical evaluation of the distance function learning approaches.


Links to 4 Papers 1. [EZZ04] C. Eick, N. Zeidat, Z. Zhao, Supervised Clustering --- Algorithms and Benefits, short version

appeared in Proc. International Conference on Tools with AI (ICTAI), Boca Raton, Florida, November 2004. http://www.cs.uh.edu/~ceick/kdd/EZZ04.pdf

2. [RE05] T. Ryu and C. Eick, A Clustering Methodology and Tool, in Information Sciences 171(1-3): 29-59 (2005). http://www.cs.uh.edu/~ceick/kdd/RE05.doc

3. [ERBV04] C. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta, Using Clustering to Learn Distance Functions for Supervised Similarity Assessment, in Proc. MLDM'05, Leipzig, Germany, July 2005. http://www.cs.uh.edu/~ceick/kdd/ERBV05.pdf

4. [BECV05] A. Bagherjeiran, C. Eick, C.-S. Chen, R. Vilalta, Adaptive Clustering: Obtaining Better Clusters Using Feedback and Past Experience, submitted for publication. http://www.cs.uh.edu/~ceick/kdd/BECV05.pdf


Question?


Randomized Hill Climbing

• Fast start: algorithm starts from small neighborhood size until it can not find any better solutions. Then it increases its neighborhood size by 3 times hopping that a better solution can be found by trying more points• Shoulder condition: When the algorithm has moved to a shoulder or flat hill, it will keep getting solutions with the same fitness value. Our algorithm terminates when it has tried for 3 times and still getting the same results. This prevents it from been trapped in a shoulder forever


Randomized Hill Climbing

Shoulder

Flat hill

State space

Objective function


Purity in clusters obtained (internal)

Test 2.2 (Beta=0.4)

Inside outside weight updating (Repeat 200 times)

SCEC paramet

ers PS=200, n=30

Learning Rate(%) Diabetes Vehicle HeartStatlog Glass Heart-C Heart-H IONOSPHERE

10 0.231770.3514

2 0.133330.2423

00.3300

30.1428

6 0.11252

35 0.221350.3387

0 0.140740.2383

20.3300

30.1462

5 0.08717

50 0.213540.3621

3 0.140740.2609

90.3333

30.1326

5 0.08717

70 0.217450.3554

5 0.140740.2387

10.3333

30.1360

5 0.08717


Purity in clusters obtained (internal)

Test 2.2 (Beta=0.4)

Randomize Hill Climbing (p=30)

SCEC parameters PS=200, n=30

Learning Rater(%)

Diabetes Vehicle

HeartStatlog Glass Heart-C Heart-H IONOSPHERE

5 0.2174 0.3532 0.1407 0.2804 0.3399 0.1361 0.1196

15 0.2227 0.3550 0.1296 0.2407 0.3366 0.1020 0.1150

30 0.2174 0.3515 0.1148 0.2323 0.3333 0.1259 0.1207

50 0.2174 0.3320 0.1111 0.2330 0.3333 0.1259 0.1054

65 0.2214 0.3108 0.1148 0.2323 0.3135 0.1190 0.0957

80 0.2083 0.3092 0.1148 0.2196 0.3300 0.1361 0.1082

90 0.2057 0.3108 0.1296 0.2349 0.3201 0.1088 0.0872


Ch. Eick

Objectives Supervised Clustering: Minimize cluster impurity while keeping the number of clusters low (expressed by a fitness function q(X)).

Different Forms of Clustering


A Fitness Function for Supervised Clustering

q(X) := Impurity(X) + β*Penalty(k)

ck

ck

0

n

ck

Penalty(k) and

,n

ExamplesMinority of # )Impurity(X where k: number of clusters used

n: number of examples the dataset

c: number of classes in a dataset.

β: Weight for Penalty(k), 0< β ≤2.0

Penalty(k) vs k

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 5 10 26 53k

Pe

na

lty(k

)

k

Penalty(k) increase sub-linearly.

because the effect of increasing the # of clusters from k to k+1 has greater effect on the end result when k is small than when it is large. Hence the formula above

ch. eick et al.: using clustering to learn distance functions mldm 2005 using clustering to learn...

Documents