data clustring

DATA CLUSTRING

DATA Data is any raw material or unorganized information.

CLUSTER Cluster is group of objects that belongs to a

same class. Cluster is a set of tables physically stored

together as one table that shares common columns.

Data Clustering

Data clustering is technique in which the information that is logically similar is physically stored together.

Clustering is “the process of organizing objects into groups whose members are similar in some way

In clustering the objects of similar properties are placed in one class of objects. (eg: Nic,lib)

DATA CLUSTRING

Why clustering?

A few good reasons ...

Simplifications (eg. Lib) Pattern detection (eg. fb img) Useful in data concept construction Unsupervised learning process

Procedure that identify groups in the data.

Where we use data clustering ? Data Mining Pattern Recognition Speech Recognition Text Mining Web Analysis Marketing Medical Diagnostic Image Processing

Applications of Data Clustering

A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity

The quality of a clustering result depends on both the similarity measure used by the method and its implementation.

The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

What Is Good Clustering ?

Good Clustering

Data mining is the process to discover information from large amounts of data, using pattern recognition technologies and mathematical techniques.

Data mining is widely used in many domains, such as retail, finance, telecommunication and social media

Data Clustering in Data Mining

(The analysis step of the "Knowledge Discovery in Databases" process, or KDD)

Partitioning MethodsHierarchical MethodsDensity-Based MethodsGrid-Based MethodsModel-Based Clustering Methods

Major Clustering Approaches

Partitioning method: Construct a partition of a database D of n objects into a set of k clusters

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by

the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

Partitioning Methods

Given k, the k-means algorithm is implemented in 4 steps:Partition objects into k nonempty subsetsCompute seed points as the centroids of the

clusters of the current partition. The centroid is the center (mean point) of the cluster.

Assign each object to the cluster with the nearest seed point.

Go back to Step 2, stop when no more new assignment.

The K-Means Clustering Method

.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

The K-Means Clustering Method EXAMPLE

Create a hierarchical decomposition of the set of data (or objects) using some criterion

Hierarchical Clustering

Hierarchical Clustering

Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition

agglomerative (AGNES)

Bottom-up

divisive (DIANA)

Top-down

c

d

e

a

bab

de

cde

abcde

Density-based: based on connectivity and density functions

Grid-based: based on a multiple-level granularity structure

Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

Other Algorithms

Scalability We need highly scalable clustering algorithms to deal with large databases. The ability of a system to handle a growing amount of work in a capable

manner Ability to deal with different kind of attributes

Algorithms should be capable to be applied on any kind of data such as interval based (numerical) data, categorical, binary data.

High dimensionality The clustering algorithm should not only be able to handle low- dimensional data

but also the high dimensional space. Ability to deal with noisy data

Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.

Interpretability The clustering results should be interpretable, comprehensible and usable.

Requirements of Clustering in Data Mining

Conclusion

In this presentation, i try to give the basic concept of clustering by first providing the definition of clustering and then the definition of some related terms. i give some examples to elaborate the concept. Then i give different approaches to data clustering and also discussed some algorithms to implement that approaches. The partitioning method and hierarchical method of clustering were explained. The applications of clustering are also discussed with the examples of medical images database, data mining using data clustering

Thank You…

data clustring

Technology