clustering analysis in data mining · clustering analysis in data mining k.s ivaraman 1, p.arumugam...

Clustering Analysis in Data Mining

K.Sivaraman1, P.Arumugam

2

Assistant Professor 1 2

Department of CSE, BIST, BIHER, Bharath University, Chennai.

[email protected]

Abstract: The process of grouping a set of physical

or abstract objects into classes of similar

objects is called clustering. Clustering

analysis is one of the main analytical

methods in data mining. The method of

clustering algorithm will influence the

clustering result directly. A cluster of data

objects can be treated collectively as one

group and so may be considered as a form

of data compression. This paper discusses the various types of

algorithms like k-means clustering

algorithms, etc. and analyzes the

advantages and shortcomings of the

various algorithms. We can calculate the

distance between each data clustered. This

paper provides a broad survey of the most

basic techniques and identifies. The results

are discussed on high datasets. Keywords: Clustering, Datasets, Machine-learning, Deterministic

1. Introduction:

Cluster analysis is the automatic

identification of groups of similar objects.

Cluster analysis is the organization

collection of patterns. Clustering plays an

important role in data analysis. It has been

used widely for data analysis and has been

an active subject in various research fields

such as statistics[1-6], pattern recognition

and machine learning. Clustering is an

unsupervised learning method that groups’

data into subgroup called clusters based on

well-defined measures of similarity

between two objects. A variety of

clustering approaches have been developed

for different goals and applications in

specific area[7-12]. The goal of this paper is to survey the core

concepts and techniques in the large

subsets of cluster analysis with its roots in

statistics and decision theory. Where

appropriate, references will be

International Journal of Pure and Applied MathematicsVolume 119 No. 12 2018, 9639-9649ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

9639

made to key concepts and techniques

arising from methodology in the machine

learning and other communities[13-19].

Cluster Definitions:

Clustering is the process of partitioning a

set of data or objects into a set of

meaningful sub-classes, called clusters.

There is no objective function.

There is no dependent variable. The

segmentation develops on its own values

of the input variables.

That is why it is unsupervised

learning.

2. Typical Requirements for Good

Clustering Techniques in Data Mining:

Scalability: The cluster method should be applicable to

huge database and performance should

decrease sequentially with data size

increases.

Versatility:

The objects can be of different types-

numerical data, Boolean data or categorical

data. The clustering method should be

suitable for all distinct types of data

objects.

Ability to discover clusters with

different shapes:

This is important requirements for spatial

data clustering. Many clustering algorithms

can only discover clusters with spherical

shapes. However, a cluster could be of any shape.

It is important to develop algorithms that

can identify clusters of different shapes.

Minimal input parameter:

The clustering results can be quite

sensitive to input parameters. Parameters

are difficult to determine for datasets containing high

dimensional objects. However, most

clustering algorithms have several

keys parameters and they are not practical

for use in real world applications. This not

only burdens users but also makes the

quality of clustering difficult to control.

Robust with regard to noise:

This is important because as a result noise

exists everywhere. A good clustering

algorithm should be able to perform

successfully even in the presence of

noise[20-26]. Some clustering algorithms

are sensitive to data and may lead to

clusters of poor quality.

Insensitive to the order of data

input:

Some cluster algorithms cannot

incorporate newly inserted information

(i.e., database updates)

International Journal of Pure and Applied Mathematics Special Issue

9640

into existing clustering structures. The

clustering method should give consistent

results of the order the data is presented. It is important to develop incremental

clustering algorithms and algorithms that

are insensitive to the order of input[27-

33].

High dimensionality:

The ability to handle high dimensionality

is very challenging but real data sets are

often multidimensional. Human eyes are

good to judge the quality of cluster for up to three dimensions[34-39]. Finding

clusters of data objects in high

dimensional space is challenging

especially considering that such data can

be sparse and highly skewed.

Interpretability and usability:

The clustering may need to be tied to

specific semantic interpretations and

applications. It is important to study how

an application goal may influence the

selection of clustering features and

methods. 3. Taxonomy on clustering Techniques:

There exists a large number of clustering

algorithms. Generally these clustering

algorithms can be clustered into four

groups: Partitioning methods,

Hierarchical methods, Density-Based

methods and Grid-

Based methods. In order to examine the

clustering ability of clustering algorithms,

we performed experimental evaluation

upon k-means. 3.1. Partitioning Methods:

Assume there are n objects in the

original dataset, partitioning methods

method breaks data set into k partitions.

A partitioning method constructs k

partitions of the data, where each

partition represents a cluster and k <= n.

That is, it classifies the data into k

groups, which together satisfy the

following requirements:

Each group must contain at least

one object, and

Each object must belong to exactly

one group.

Where each cluster is represented by the

gravity center of the cluster in k-means

method or by one of the “central”

objects of the cluster in k-medoid

method. Once cluster representatives

are selected, data points are assigned to

these representatives[40-45]. The general criterion of a good

partitioning is that objects in the same

cluster are close or related to each other,

whereas objects of different clusters are

far apart or very different.

All the partitioning methods have a

similar clustering quality and the


9641

major difficulties with these methods

include: 1. The number k of clusters to be

found needs to be known prior to

clustering requiring at least some

Domain knowledge which is often

not available;

2. It is difficult to identify clusters

with large variations in sizes (large

genuine clusters tend to be split);

3. The method is only suitable for concave spherical clusters.

3.2. Hierarchical Clustering:

A hierarchical method creates a

hierarchical decomposition of the

given set of data objects. A

hierarchical method can be

classified as being either

agglomerative or divisive, based

on how the hierarchical

decomposition is formed.

Hierarchical methods suffer from

the fact that once a step (merge or

split) is done, it can never be

undone. This rigidity is useful in

that it leads to smaller computation

costs by not having to worry about

a combinatorial number of different

choices. We apply when we usually has less

observations with smaller datasets. It shows in its stage that how each

observation is linking with

one another. Proc cluster is the command to use

in SAS. Dendogram and Scree plot are

useful for the same.

Scree plot: Cluster within clusters

variance

RMS STD is within cluster

variance.

If one cluster then each and every

observations is within that cluster.

When number of cluster = I, then

RMS STD = total variance

within data

The elbow indicates, where is the

optimal number of clusters.

This will mean homogeneous

within (because of lower RMS

STD, which is within variance) and

heterogeneous across cluster

(because of higher between

variance = total variance – within

variance)


9642

Dendogram: How cluster is forming in each step.

The dendogram shows how each data

is combined to one another n forms a

step.

Linkage Function: Intermediate cluster distance

Three types of linkage: Single linkage- It is about finding

the shortest distance between any of

the two object where one object is

from cluster A and the other from

cluster B.

Average Linkage- The mean

similarity of one cluster to another.

Complete Linkage- Defines the

cluster distance between two clusters

to be the maximum distance between

their individual components.


9643

K – Means clustering: The basic k – means clustering

technique is simple,and we begin with

a description of the basic algorithm.

First we choose k initial centroids,

where k is a user specified parameter.

Each point is assigned to the closest

centroid, and each collection of points

assingned to a centroid is a cluster.

The centroid of each cluster is

updated based on the points assigned

to the cluster. We repeat the

assignment and update steps until no

point changes clusters, or until the

centroids remains the same. At the beginning we have to decide

how many clusters are required.

Decide k the number of clusters that

are needed finally. Given k, the k-means algorithm is

implemented in four steps:

Partition the object into K non

empty subsets randomly.

Complete seed points as the

centroids of the clusters of the current

partitioning (the centroid is

the center, i.e., mean point of the

cluster).

Assign each object to the cluster

with the nearest seed point.

Go back to step2, stop when the

assigbnment does not change.

Proc fastclus is used ofr the same.

Conclusion:

We believe that cluster analysis is

an important tool ton classify units

into groups. Its main advantage is to

produce objective and replicable

classification that can develop our

knowledge. This paper provided an

intuitive introduction to cluster

analysis.

An additional issue related to

selecting an algorithm is correctly

choosing the initial set of clusters.

Also important is that some

cliustering methods, such as

hierarchical clustering need a

distance matrix which contains all

the distance between every pair of

elements in the dataset. Recently

this issue has been addressed,

resulting in new variations of

hierarchical and reciprocal nearest

neighbor clustering. This paper

provides a broad survey of the most

basic techniques.


9644

6 Conclusion

We disconfirmed in this paper that the

ac-claimed encrypted algorithm for the

inves-tigation of multicast applications

runs in Ω(log N) time, and our heuristic

is no excep-tion to that rule. We also

constructed new omniscient

symmetries. The characteristics of our

method, in relation to those of more

famous algorithms, are dubiously more

essen-tial. Along these same lines, in

fact, the main contribution of our work is

that we examined how the transistor

can be applied to the simu-lation of

superpages. Thus, our vision for the

future of hardware and architecture

certainly includes AldernCapcase.

REFERENCES

1. Hameed Hussain, J., Sharavanan,

R., Floor cleaning machine by

remote control, International

Journal of Pure and Applied

Mathematics, V-116, I-14 Special

Issue, PP-461-464, 2017

2. Hameed Hussain, J., Srinivasan, V.,

Extraction of polythene waste from

domestic waste, International



Issue, PP-427-431, 2017

3. Hameed Hussain, J.,

Thirumavalavan, S., Flow analysis

of copper tube for solar trough

collector without joint, International



Issue, PP-541-544, 2017

4. Hanirex, D.K., Kaliyamurthie, K.P.,

Mining the financial multi-

relationship with accurate models,

Middle - East Journal of Scientific

Research, V-19, I-6, PP-795-798,

2014

5. Hemapriya, M., Meikandaan, T.P.,

Repair of damaged reinforced

concrete beam by externally bonded

with CFRP sheets, International



Issue, PP-473-479, 2017


Experimental study on changes in

properties of cement concrete using

steel slag and fly ash, International



Issue, PP-369-375, 2017


Experimental study on structural

repair and strengthening of RC

beams with FRP laminates,

International Journal of Pure and

Applied Mathematics, V-116, I-13

Special Issue, PP-355-361, 2017


Effect of high range water reducers

on sorptivity and water

permeability of concrete,





Strength and workability

characteristics of super plasticized

concrete, International Journal of

Pure and Applied Mathematics, V-

116, I-13 Special Issue, PP-345-

353, 2017


Potency and workability behavior

of quality plasticized structural

material, International Journal of



367, 2017

11. Hussain, J.H., Manavalan, S.,

Optimization of properties of

jatropha methyl Ester (JME) from

jatropha oil, International Journal of


9645



484, 2017

12. Hussain, J.H., Manavalan, S.,

Optimization and comparison of

properties of neem and jatropha

biodiesels, International Journal of


116, I-17 Special Issue, PP-79-82,

2017

13. Hussain, J.H., Meenakshi, C.M.,

Simulation and analysis of heavy

vehicles composite leaf spring,




14. Hussain, J.H., Nimal, R.J.G.R.,

Review: Investigation on

mechanical properties of different

metal matrix composites in

diffusion bonding method by using

metal interlayers, International



Issue, PP-459-464, 2017

15. Jagadeeswari, P., Subashini, G.,

Basic results of probability,




16. Janani, V.D., Kavitha, S.,

Conceptual level similarity measure

based review spam detection

adversarial spam detection using

the randomized hough transform-

support vector machine,




17. Jasmin, M., Beulah Hemalatha, S.,

Security for industrial

communication system using

encryption / decryption modules,





VLSI-based frequency spectrum

analyzer for low area chip design

by using yasmirub method,





RFID security and privacy

enhancement, International Journal

of Pure and Applied Mathematics,

V-116, I-15 Special Issue, PP-535-

538, 2017


Digital phase locked loop,




21. Jeyalakshmi, G., Arulselvi, S.,

Community oriented configurations

for WSN, International Journal of



533, 2017


Investigating file systems,





Methodology for the development

of lambda calculus, International



Issue, PP-511-515, 2017


Remote procedure calls in access

points, International Journal of Pure

and Applied Mathematics, V-116,

I-15 Special Issue, PP-523-526,

2017

25. Jeyanthi Rebecca, L., Anbuselvi, S.,

Sharmila, S., Medok, P., Sarkar, D.,

Effect of marine waste on plant

growth, Der Pharmacia Lettre, V-7,

I-10, PP-299-301, 2015

26. Kaliyamurthie, K.P., Parameswari,

D., Udayakumar, R., Malicious

packet loss during routing

misbehavior-identification, Middle


9646

- East Journal of Scientific

Research, V-20, I-11, PP-1413-

1416, 2014

27. Kanagavalli, G., Sangeetha, M.,

Intelligent trafficlight system for

reducedfuel consumption,





GPS based blind pedestrian

positioning and voice response

system, International Journal of



484, 2017


Detection of retinal abnormality by

contrast enhancement

methodusingcurvelet transform,





Design of low power VLSI circuits

for precharge logic, International



Issue, PP-505-509, 2017

31. Kanniga, E., Selvaramarathnam, K.,

Sundararajan, M., Kandigital bike

operating system, Middle - East

Journal of Scientific Research, V-

20, I-6, PP-685-688, 2014

32. Karthik, B., Arulselvi, Noise

removal using mixtures of projected

gaussian scale mixtures, Middle -

East Journal of Scientific Research,

V-20, I-12, PP-2335-2340, 2014

33. Karthik, B., Arulselvi, Selvaraj, A.,

Test data compression architecture

for lowpowervlsi testing, Middle -

East Journal of Scientific Research,

V-20, I-12, PP-2331-2334, 2014

34. Karthikeyan, R., Michael, G.,

Kumaravel, A., A housing selection

method for design,

implementation&evaluation

for web based recommended

systems, International Journal of


116, I-8 Special Issue, PP-23-27,

2017

35. Khanaa, V., Thooyamani, K.P.,

Using lookup table circulating

fluidised bed combustion boiler by

the method of sensor output

linearization, Middle - East Journal

of Scientific Research, V-16, I-12,

PP-1801-1806, 2013


Face routing protocol using genetic

algorithm in, Middle - East Journal

of Scientific Research, V-16, I-12,

PP-1863-1867, 2013


Udayakumar, R., Two factor

authentication using mobile phones,

World Applied Sciences Journal,

V-29, I-14, PP-208-213, 2014


Udayakumar, R., Next major wave

of it inovation, World Applied

Sciences Journal, V-29, I-14, PP-

218-220, 2014


Udayakumar, R., Traffic policing

approach for wireless video

conference traffic, World Applied

Sciences Journal, V-29, I-14, PP-

200-207, 2014


Udayakumar, R., Patient

monitoring in gene ontology with

words computing using SOM,


V-29, I-14, PP-195-199, 2014


Udayakumar, R., Load balancing in

structured PEER to PEER systems,


V-29, I-14, PP-186-189, 2014


Udayakumar, R., Impact of route

stability under random based

mobility model in MANET, World


9647

Applied Sciences Journal, V-29, I-

14, PP-274-278, 2014


Udayakumar, R., Modelling Cloud

Storage, World Applied Sciences

Journal, V-29, I-14, PP-190-194,

2014


Udayakumar, R., Elliptic curve

cryptography using in multicast

network, World Applied Sciences

Journal, V-29, I-14, PP-264-269,

2014 45. Khanaa, V., Thooyamani, K.P.,

Udayakumar, R., SRW/U as a

lingua franca in managing the

diversified information resources,


V-29, I-14, PP-279-284, 2014


9648

clustering analysis in data mining · clustering analysis in data mining k.s ivaraman 1, p.arumugam...

Documents