a clustering framework for unbalanced partitioning and outlier filtering on high dimensional...

27
A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department of Computer Engineering, Maltepe University [email protected] 2 Department of Computer and Control Education,Marmara University [email protected]

Upload: william-luter

Post on 11-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets

1Turgay Tugay Bilgin and A.Yilmaz Camurcu 2

1Department of Computer Engineering, Maltepe University

[email protected] of Computer and Control Education,Marmara University

[email protected]

Page 2: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Outline Introduction Relationship based clustering approach / framework Visualization using CLUSION (CLUSter visualizatION) Problems of the Framework Graclus partitioning system Our Proposed Framework

Using Graclus: to create Micro-partition Space Outlier filtering on micro-partition space Using Graclus: to cluster ΔP Space Visualization of the results using CLUSION graphs

Experiments Results

Page 3: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Introduction

Mining high dimensional datasets are an important problem of Data Mining community

Well-known problem: curse of dimensionality Graph based methods such as METIS and

CHACO perform best on high dimensional space However, these methods have 2 major

problems: can not perform outlier filtering Force clusters to be balanced

Page 4: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Relationship based Clustering Approach Strehl A. and Ghosh J. proposed a better approach for

mining high dimensional datasets [1]. They focus on similarity space rather than Feature

space. A graph partitioning tool METIS is used to perform

balanced clustering (OPOSSUM) They also provide a customized matrix visualization tool

called CLUSION. CLUSION is fast,simple and it can operate on very high

dimensional datasets.

Page 5: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Relationship based Clustering Framework

Data Sources Feature Space Similarity SpaceCluster Labels

Feature Extraction

Similarity

computation

OPOSSUM(Optimal partitioning of Similarity space using Metis)

Page 6: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Visualization using CLUSION

Clusters appear as symmetrical dark squares across the main diagonal

Similarity Matrix

λ index

CLUSION

S is permuted with a nxn permutation matrix P

Cluster Visualization

Page 7: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Problems of the Framework

Produces balanced clusters only:

It forces clusters to be of equal size. In some datasests this could be important, because it avoids trivial clusterings. But in most cases, can cause undesired results.

No outlier filtering :

Outliers can reduce the quality and the validity of the clusters depending on the resolution and distribution of the dataset.

Page 8: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Graclus* partitioning system

Graclus* is a fast kernel based multilevel algorithm which involves coarsening, initial partitioning and refinement phases.

Unlike METIS, it does not force clusters to be nearly,equal size. Uses weighted form of kernel based k-means

approach kernel k-means approach is extremely fast and gives

high-quality partitions (*)

* Dhillon, I., Guan, Y., Kulis,B.: A Fast Kernel-based Multilevel Algorithm for Graph Clustering, Proceedings of The 11th ACM SIGKDD, Chicago, IL, August 21 - 24, (2005).

Page 9: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Our Proposed Framework Three major improvements:

An intermediate space (P):We call it “micro-partition space”. Graclus is used for creating

unbalanced micro-partitions. Outlier filtering on the P space (results ΔP) :

Graclus creates micro-partitions of different sizes. The singletons on the P space means the points that have not enough neighbors can be filtered or marked as outliers.

Using Graclus for clustering ΔP space:Graclus has two important roles on our framework. The first role is creating the micro-partition space .The second role is unbalanced clustering of the filtered space ΔP which is denoted by Φ.

Page 10: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Our Proposed Framework

creating micro-partitions

(using Graclus)

Micro-partition space (P) Contains unbalanced tiny partitions

outlier filtering and (re)clustering (using

Graclus) results ΔP Space

ΔP

Page 11: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Use Graclus in Similarity Space to create tiny partitions (micro-partitions)

Notation: n = number of samples, k = number of micro-partitions on P space

relation between k and p should be: [1] Micro-partitions can contain up to 4 objects,

therefore: [2]

Using Graclus: to create Micro-partition Space

Page 12: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Outlier filtering on micro-partition space illustration

Page 13: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Outlier filtering on micro-partition space

Outliers in P space (Po) is:

where To is Outlier threshold value Then, ΔP space is:

Page 14: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Graclus needs the number of partitions k. In formula [1], k refers to the number of micro

partitions. Here k refers to the number of clusters we desire.

we denote the former one by k1 and the latter one by k2 .

Graclus performs clustering on the ΔP space and produces λ index which is defined as:

Using Graclus: to cluster ΔP Space

Page 15: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Visualization of the results using CLUSION graphs CLUSION looks at the λ,

reorders the ΔP space so that points with same cluster label are contiguous

then visualize the resulting permuted ΔP′

there are two λ indices produced during clustering process. λ1 is created while forming micro-partitions

λ2 is created while clustering ΔP space

We use λ2 for CLUSION, the first one is only used for forming micro-partitions

Page 16: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Experiments: Datasets

We evaluated our proposed framework on two different real world datasets.

1. 9636 terms from 2225 complete news articles from the BBC News web site. (2225 dimensional dataset, 5 natural clusters)

2. Collection of news articles from Turkish newspaper “Milliyet”. Contains 6223 terms in Turkish from 1455 news articles. (1455 dimensional dataset, 3 natural clusters)

Page 17: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Experiments:Evaluated Frameworks

OPOSSUM: Strehl & Ghosh’s METIS based original framework

S&G(Graclus):We replaced METIS by Graclus on Strehl & Ghosh’s framework for testing the quality of the clusters produced by Graclus algorithm.

P space+Graclus: Our proposed framework.

Page 18: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Experiments: Comparison Criteria

Purity Entropy Mutual Information CLUSION graphics

(visually identification, visual data mining)

Page 19: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Results: BBC Dataset

Page 20: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Results: BBC Dataset OPOSSUM

Page 21: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Results: BBC Dataset S&G(Graclus):

Page 22: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Results:BBC Dataset P space+Graclus

Page 23: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Results: Milliyet Dataset

Page 24: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Results: Milliyet Dataset OPOSSUM

Page 25: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Results: Milliyet Dataset S&G(Graclus):

Page 26: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Results:Milliyet Dataset P space+Graclus

Page 27: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Thank You!

Presenter : T.Tugay BiLGiN