ric: parameter-free noise-robust clustering

Post on 06-Jan-2016

40 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

RIC: Parameter-Free Noise-Robust Clustering. Presenter : Shu-Ya Li Authors : CHRISTIAN BO¨ HM, CHRISTOS FALOUTSOS, JIA-YU PAN, CLAUDIA PLANT. TKDD, 2007. Outline. Motivation Objective Methodology Experiments and Results Conclusion Personal Comments. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

RIC: Parameter-Free Noise-Robust Clustering

Presenter : Shu-Ya Li

Authors : CHRISTIAN BO¨ HM, CHRISTOS FALOUTSOS,

JIA-YU PAN, CLAUDIA PLANT

TKDD, 2007

2Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Outline

Motivation

Objective

Methodology

Experiments and Results

Conclusion

Personal Comments

3Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

How to find a natural clustering of a real-world point set which contains

an unknown number of clusters with different shapes

the clusters may be contaminated by noise?

4Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objectives

Find natural clustering in a dataset Goodness of a clustering

We use Volume after Compression (VAC) to quantify the ‘goodness’ of a grouping by.

Efficient algorithm for good clustering

Robust Fitting Cluster Merging

MDL for classificationVAC for clustering

5Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.VAC (Volume after Compression )

VAC Tells which grouping is better

Lower VAC => better grouping

Formula using decorrelation matrix

Computing VAC Compute covariance matrix of cluster C

Compute PCA and obtain decorrelation matrix

Compute VAC from the matrix

6Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Computing VAC

VAC (volume after compression) Record bytes to record their type (guassian, uniform,..)

Record bytes for number of clusters k

The bytes to describe the parameters of each distribution (e.g., mean, variance, covariance, slope, intercept) and then the location of each point

Cluster Model

stat = (μi, σi, lbi, ubi, ...)

2.3+4.3=6.6bits

7Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology – RIC framework

Robust Fitting Mahalanobis distance defined by Λ and V

Conventional estimation: covariance matrix uses Mean

Robust estimation: covariance matrix uses Median

Median is less affected by outliers than Mean

PCA (Σ = V ΛV T)

median

μ

μR

8Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology – RIC framework

Cluster Merging Merge Ci and Cj only if the combined VAC decreases

If savedCost > 0, then merge Ci and Cj

Greedy search to maximize savedCost, hence minimize VAC

9Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

Results on Synthetic Data

10Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

Performance on Real Data

11Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

Compares the result of filterOpt to the result of filterDist.

12Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusion

The contributions of this work are the answers to the two questions, organized in our RIC framework. (Q1) Goodness Measure.

We propose the VAC criterion using information-theory concepts, and specifically the volume after compression.

(Q2) Efficiency. Robust fitting (RF) algorithm, which carefully avoids outliers.

Cluster merging (CM) algorithm, which stitches clusters together if the stitching gives a better VAC score.

13Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Personal Comments

Advantage Description detail

Many pictures and examples

Drawback It is difficult to identify black and white picture.

Application Clustering

top related