a clustering framework for unbalanced partitioning and outlier filtering on high dimensional...

A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets

1Turgay Tugay Bilgin and A.Yilmaz Camurcu 2

1Department of Computer Engineering, Maltepe University

[email protected] of Computer and Control Education,Marmara University

[email protected]

Outline Introduction Relationship based clustering approach / framework Visualization using CLUSION (CLUSter visualizatION) Problems of the Framework Graclus partitioning system Our Proposed Framework

Using Graclus: to create Micro-partition Space Outlier filtering on micro-partition space Using Graclus: to cluster ΔP Space Visualization of the results using CLUSION graphs

Experiments Results

Introduction

Mining high dimensional datasets are an important problem of Data Mining community

Well-known problem: curse of dimensionality Graph based methods such as METIS and

CHACO perform best on high dimensional space However, these methods have 2 major

problems: can not perform outlier filtering Force clusters to be balanced

Relationship based Clustering Approach Strehl A. and Ghosh J. proposed a better approach for

mining high dimensional datasets [1]. They focus on similarity space rather than Feature

space. A graph partitioning tool METIS is used to perform

balanced clustering (OPOSSUM) They also provide a customized matrix visualization tool

called CLUSION. CLUSION is fast,simple and it can operate on very high

dimensional datasets.

Relationship based Clustering Framework

Data Sources Feature Space Similarity SpaceCluster Labels

Feature Extraction

Similarity

computation

OPOSSUM(Optimal partitioning of Similarity space using Metis)

Visualization using CLUSION

Clusters appear as symmetrical dark squares across the main diagonal

Similarity Matrix

λ index

CLUSION

S is permuted with a nxn permutation matrix P

Cluster Visualization

Problems of the Framework

Produces balanced clusters only:

It forces clusters to be of equal size. In some datasests this could be important, because it avoids trivial clusterings. But in most cases, can cause undesired results.

No outlier filtering :

Outliers can reduce the quality and the validity of the clusters depending on the resolution and distribution of the dataset.

Graclus* partitioning system

Graclus* is a fast kernel based multilevel algorithm which involves coarsening, initial partitioning and refinement phases.

Unlike METIS, it does not force clusters to be nearly,equal size. Uses weighted form of kernel based k-means

approach kernel k-means approach is extremely fast and gives

high-quality partitions (*)

* Dhillon, I., Guan, Y., Kulis,B.: A Fast Kernel-based Multilevel Algorithm for Graph Clustering, Proceedings of The 11th ACM SIGKDD, Chicago, IL, August 21 - 24, (2005).

Our Proposed Framework Three major improvements:

An intermediate space (P):We call it “micro-partition space”. Graclus is used for creating

unbalanced micro-partitions. Outlier filtering on the P space (results ΔP) :

Graclus creates micro-partitions of different sizes. The singletons on the P space means the points that have not enough neighbors can be filtered or marked as outliers.

Using Graclus for clustering ΔP space:Graclus has two important roles on our framework. The first role is creating the micro-partition space .The second role is unbalanced clustering of the filtered space ΔP which is denoted by Φ.

Our Proposed Framework

creating micro-partitions

(using Graclus)

Micro-partition space (P) Contains unbalanced tiny partitions

outlier filtering and (re)clustering (using

Graclus) results ΔP Space

ΔP

Use Graclus in Similarity Space to create tiny partitions (micro-partitions)

Notation: n = number of samples, k = number of micro-partitions on P space

relation between k and p should be: [1] Micro-partitions can contain up to 4 objects,

therefore: [2]

Using Graclus: to create Micro-partition Space

Outlier filtering on micro-partition space illustration

Outlier filtering on micro-partition space

Outliers in P space (Po) is:

where To is Outlier threshold value Then, ΔP space is:

Graclus needs the number of partitions k. In formula [1], k refers to the number of micro

partitions. Here k refers to the number of clusters we desire.

we denote the former one by k1 and the latter one by k2 .

Graclus performs clustering on the ΔP space and produces λ index which is defined as:

Using Graclus: to cluster ΔP Space

Visualization of the results using CLUSION graphs CLUSION looks at the λ,

reorders the ΔP space so that points with same cluster label are contiguous

then visualize the resulting permuted ΔP′

there are two λ indices produced during clustering process. λ1 is created while forming micro-partitions

λ2 is created while clustering ΔP space

We use λ2 for CLUSION, the first one is only used for forming micro-partitions

Experiments: Datasets

We evaluated our proposed framework on two different real world datasets.

1. 9636 terms from 2225 complete news articles from the BBC News web site. (2225 dimensional dataset, 5 natural clusters)

2. Collection of news articles from Turkish newspaper “Milliyet”. Contains 6223 terms in Turkish from 1455 news articles. (1455 dimensional dataset, 3 natural clusters)

Experiments:Evaluated Frameworks

OPOSSUM: Strehl & Ghosh’s METIS based original framework

S&G(Graclus):We replaced METIS by Graclus on Strehl & Ghosh’s framework for testing the quality of the clusters produced by Graclus algorithm.

P space+Graclus: Our proposed framework.

Experiments: Comparison Criteria

Purity Entropy Mutual Information CLUSION graphics

(visually identification, visual data mining)

Results: BBC Dataset

Results: BBC Dataset OPOSSUM

Results: BBC Dataset S&G(Graclus):

Results:BBC Dataset P space+Graclus

Results: Milliyet Dataset

Results: Milliyet Dataset OPOSSUM

Results: Milliyet Dataset S&G(Graclus):

Results:Milliyet Dataset P space+Graclus

Thank You!

Presenter : T.Tugay BiLGiN

a clustering framework for unbalanced partitioning and outlier filtering on high dimensional...

Documents

high dimensional space

cluster p space visualization

p space results p

intermediate space p

clusion clusters

balanced slide

partitioning system

proposed framework