data clustring

20
DATA CLUSTRING

Upload: salman-memon

Post on 13-Jan-2017

287 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Data clustring

DATA CLUSTRING

Page 2: Data clustring

DATA Data is any raw material or unorganized information.

CLUSTER Cluster is group of objects that belongs to a

same class. Cluster is a set of tables physically stored

together as one table that shares common columns.

Data Clustering

Page 3: Data clustring

Data clustering is technique in which the information that is logically similar is physically stored together.

Clustering is “the process of organizing objects into groups whose members are similar in some way

In clustering the objects of similar properties are placed in one class of objects. (eg: Nic,lib)

DATA CLUSTRING

Page 4: Data clustring
Page 5: Data clustring

Why clustering?

A few good reasons ...

Simplifications (eg. Lib) Pattern detection (eg. fb img) Useful in data concept construction Unsupervised learning process

Procedure that identify groups in the data.

Page 6: Data clustring

Where we use data clustering ? Data Mining Pattern Recognition Speech Recognition Text Mining Web Analysis Marketing Medical Diagnostic Image Processing

Applications of Data Clustering

Page 7: Data clustring

A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity

The quality of a clustering result depends on both the similarity measure used by the method and its implementation.

The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

What Is Good Clustering ?

Page 8: Data clustring

Good Clustering

Page 9: Data clustring

Data mining is the process to discover information from large amounts of data, using pattern recognition technologies and mathematical techniques.

Data mining is widely used in many domains, such as retail, finance, telecommunication and social media

Data Clustering in Data Mining

(The analysis step of the "Knowledge Discovery in Databases" process, or KDD)

Page 10: Data clustring

Partitioning MethodsHierarchical MethodsDensity-Based MethodsGrid-Based MethodsModel-Based Clustering Methods

Major Clustering Approaches

Page 11: Data clustring

Partitioning method: Construct a partition of a database D of n objects into a set of k clusters

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by

the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

Partitioning Methods

Page 12: Data clustring

Given k, the k-means algorithm is implemented in 4 steps:Partition objects into k nonempty subsetsCompute seed points as the centroids of the

clusters of the current partition. The centroid is the center (mean point) of the cluster.

Assign each object to the cluster with the nearest seed point.

Go back to Step 2, stop when no more new assignment.

The K-Means Clustering Method

Page 13: Data clustring

.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

The K-Means Clustering Method EXAMPLE

Page 14: Data clustring

Create a hierarchical decomposition of the set of data (or objects) using some criterion

Hierarchical Clustering

Page 15: Data clustring

Hierarchical Clustering

Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition

agglomerative (AGNES)

Bottom-up

divisive (DIANA)

Top-down

c

d

e

a

bab

de

cde

abcde

Page 16: Data clustring

Density-based: based on connectivity and density functions

Grid-based: based on a multiple-level granularity structure

Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

Other Algorithms

Page 17: Data clustring

Scalability  We need highly scalable clustering algorithms to deal with large databases. The ability of a system to handle a growing amount of work in a capable

manner Ability to deal with different kind of attributes 

Algorithms should be capable to be applied on any kind of data such as interval based (numerical) data, categorical, binary data.

High dimensionality  The clustering algorithm should not only be able to handle low- dimensional data

but also the high dimensional space. Ability to deal with noisy data 

Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.

Interpretability  The clustering results should be interpretable, comprehensible and usable.

Requirements of Clustering in Data Mining

Page 18: Data clustring

Conclusion

In this presentation, i try to give the basic concept of clustering by first providing the definition of clustering and then the definition of some related terms. i give some examples to elaborate the concept. Then i give different approaches to data clustering and also discussed some algorithms to implement that approaches. The partitioning method and hierarchical method of clustering were explained. The applications of clustering are also discussed with the examples of medical images database, data mining using data clustering

Page 19: Data clustring
Page 20: Data clustring

Thank You…